TTP: A Hardware-Efficient Design for Precise Prefetching in Ray Tracing

Anshul Naithani; Huiyang Zhou; Yavuz Selim Tozlu

arxiv: 2605.16253 · v1 · pith:5WBQNOLBnew · submitted 2026-05-15 · 💻 cs.AR

TTP: A Hardware-Efficient Design for Precise Prefetching in Ray Tracing

Yavuz Selim Tozlu , Anshul Naithani , Huiyang Zhou This is my paper

Pith reviewed 2026-05-19 18:11 UTC · model grok-4.3

classification 💻 cs.AR

keywords ray tracingprefetchingBVH traversalhardware prefetcherGPU architecturememory latencytraversal stack

0 comments

The pith

A prefetcher that monitors consecutive pops from ray tracing traversal stacks delivers 1.48x average speedup with negligible hardware overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a hardware prefetcher for ray tracing called TTP that exploits the traversal stack already built into dedicated RT units. By generating prefetch requests for BVH nodes when addresses are popped consecutively from the stack during depth-first traversal, the design anticipates upward movements through the tree. This yields 1.48 times average speedup and up to 1.89 times in tested cases while adding almost no extra circuitry. The prefetches prove highly accurate at 98.92 percent on average, with 31.54 percent coverage of baseline misses.

Core claim

TTP prefetches nodes using the addresses already available on the hardware traversal stacks of each thread. For DFS-based traversal, prefetches are generated when nodes are being popped consecutively from the traversal stack, potentially corresponding to upward traversal through the tree. Evaluated on the cycle-level simulator Vulkan-sim 2.0, TTP achieves 1.48x speedup on average compared to the baseline with nearly negligible hardware overhead, 98.92 percent average L1 accuracy, and 31.54 percent coverage.

What carries the argument

Tree Traversal Prefetcher (TTP), which issues prefetch requests from addresses on the existing hardware traversal stack during consecutive pops in depth-first BVH traversal.

If this is right

Ray tracing workloads experience fewer memory stalls during BVH traversal, increasing overall throughput.
High prefetch accuracy keeps cache pollution low and avoids wasting bandwidth on unused data.
The design integrates into existing RT cores because it reuses already-present stack hardware.
Speedup scales with scene size and complexity where memory latency dominates computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Stack-based monitoring may extend to prefetching in other tree-traversal workloads such as spatial databases or physics simulations.
Future hardware revisions could expose additional traversal state to enable even more targeted prefetch decisions.
The approach suggests that domain-specific knowledge of internal hardware structures can outperform generic prefetch algorithms.

Load-bearing premise

The cycle-level simulator accurately reproduces the memory access patterns and traversal stack behavior of real ray tracing hardware, and consecutive stack pops reliably indicate useful upward traversal without causing cache pollution.

What would settle it

Measuring prefetch accuracy and overall speedup when the same TTP logic is placed in real ray tracing silicon and run on physical GPUs instead of a cycle-level simulator.

Figures

Figures reproduced from arXiv: 2605.16253 by Anshul Naithani, Huiyang Zhou, Yavuz Selim Tozlu.

**Figure 2.** Figure 2: DRAM activity with and without (i.e., baseline) TTP. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Simplified BVH structure for Stanford Bunny. The green arrow [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Diagram of the GPU model used in this study. Purple blocks indicate the modified components. Redrawn from [39]. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 6.** Figure 6: Analysis of pop streaks. rays (or threads) have fetched them, and capacity or conflict misses if they were fetched by other threads but evicted from the caches later on. To see the impact of such traversal stack pop streaks, we extract traversal stack activity from the simulator. Note that during BVH traversal, every memory read request is a pop from a traversal stack. Therefore, we categorize the cache mi… view at source ↗

**Figure 5.** Figure 5: Example DFS BVH traversal and the corresponding traversal stack. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 7.** Figure 7: Percentage of RT read misses where the node was in the traversal [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: State machine that generates prefetches. [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: TTP implementation. Top of the stack is denoted by [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 6.** Figure 6: In addition, we perform a limit study by simulating with perfect upward and perfect downward traversals. For perfect upward traversal, 2nd and later pops after a push always hit in the L1 cache, and for perfect downward traversal, 1st pops after a push always hit in the L1 cache [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 10.** Figure 10: TTP speedup (higher the better), power and energy (lower the better) with DFS traversal, normalized to baseline. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 12.** Figure 12: Normalized speedups with larger L1 data caches. [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗

**Figure 13.** Figure 13: RT read MPKI with TTP normalized to baseline. Lower is better. [PITH_FULL_IMAGE:figures/full_fig_p008_13.png] view at source ↗

**Figure 11.** Figure 11: Normalized speedups with perfect upward and downward traversal. [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

**Figure 14.** Figure 14: Prefetcher accuracy and coverage with DFS traversal. Higher is [PITH_FULL_IMAGE:figures/full_fig_p009_14.png] view at source ↗

**Figure 15.** Figure 15: L1 and L2 cache responses to prefetch requests. For each scene, first [PITH_FULL_IMAGE:figures/full_fig_p009_15.png] view at source ↗

**Figure 19.** Figure 19: Speedups for different prefetch intensities, normalized to baseline. [PITH_FULL_IMAGE:figures/full_fig_p009_19.png] view at source ↗

**Figure 16.** Figure 16: Normalized speedup, power and energy at 256x256 resolution. [PITH_FULL_IMAGE:figures/full_fig_p009_16.png] view at source ↗

**Figure 21.** Figure 21: RT read MPKI with TTP normalized to baseline, with BFS [PITH_FULL_IMAGE:figures/full_fig_p010_21.png] view at source ↗

**Figure 22.** Figure 22: BFS and DFS speedups normalized to DFS without TTP. [PITH_FULL_IMAGE:figures/full_fig_p010_22.png] view at source ↗

**Figure 25.** Figure 25: Total DRAM reads for Treelet prefetcher normalized to baseline. [PITH_FULL_IMAGE:figures/full_fig_p010_25.png] view at source ↗

**Figure 26.** Figure 26: Normalized speedups with Park et al.’s prefetching strategy. [PITH_FULL_IMAGE:figures/full_fig_p011_26.png] view at source ↗

**Figure 1.** Figure 1: fig1.png [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗

read the original abstract

Ray tracing (RT) is a 3D graphics technique that offers highly realistic visuals. It is becoming prominent and accessible as GPU vendors have integrated dedicated ray tracing acceleration hardware. However, tracing millions of rays through 3D scenes consisting of high numbers of triangles in real time is challenging and requires expensive hardware. The main bottleneck in RT workloads is the expensive Bounding Volume Hierarchy (BVH) traversal task, which is a large tree structure that encodes the 3D scene. BVH traversal is a memory-bound problem, as the GPU threads spend most of their time reading tree node data from memory. In this work, we attack the memory latency bottleneck of ray tracing through prefetching. We propose a novel hardware prefetcher, named Tree Traversal Prefetcher (TTP), for ray tracing. The main idea is to leverage the existing tree traversal stack in the RT units for highly accurate prefetching. In particular, TTP prefetches nodes using the addresses already available on the hardware traversal stacks of each thread. For DFS (Depth-first search) based traversal, prefetches are generated when nodes are being popped consecutively from the traversal stack, potentially corresponding to upward traversal through the tree. We evaluate TTP on a cycle-level simulator, Vulkan-sim 2.0, and show that it achieves 1.48x speedup on average (up to 1.89x) compared to the baseline, with nearly negligible hardware overhead. TTP achieves 98.92% average L1 accuracy, which is the ratio of the prefetched blocks being actually referenced by demand loads. The coverage, computed as the ratio of L1 miss reduction over baseline L1 misses, is 31.54%, correlating well with the achieved speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TTP provides a hardware-efficient prefetcher for ray tracing by using consecutive traversal stack pops, but its reported gains depend heavily on the accuracy of the cycle-level simulator.

read the letter

The one or two things to know are that this paper describes a prefetcher called TTP which triggers on consecutive pops from the ray tracing traversal stack to prefetch BVH nodes, and it reports 1.48x average speedup in simulation with very high accuracy and tiny hardware cost. The new part is the targeted use of the stack state already in the hardware for generating these prefetches during DFS traversal. It is not just another general purpose prefetcher but one that exploits the specific upward traversal pattern indicated by stack pops. The paper does well by showing that this leads to 98.92% L1 accuracy and 31.54% coverage that correlates with the performance improvement. Keeping overhead negligible is also a plus for a hardware design. The soft spots are mainly around evaluation. All claims depend on Vulkan-sim 2.0 accurately modeling the traversal stack and memory behavior. The concern about whether consecutive pops reliably predict useful prefetches without pollution is reasonable, and the lack of hardware validation or sensitivity tests leaves some uncertainty about real-world results. This paper is for hardware designers focused on ray tracing accelerators. Readers who work on GPU memory optimizations or graphics workloads would find the mechanism and results useful. It deserves a serious referee given the practical focus and quantified benefits. I recommend sending it for peer review, as the idea is worth examining even if the simulation needs more scrutiny from reviewers.

Referee Report

2 major / 2 minor

Summary. The paper proposes Tree Traversal Prefetcher (TTP), a hardware prefetcher for ray tracing that leverages addresses already present on the traversal stacks of RT units. Prefetches are triggered on consecutive pops from the stack (assumed to mark useful upward DFS traversal). Evaluated in cycle-level simulation on Vulkan-sim 2.0, TTP reports 1.48× average speedup (max 1.89×), 98.92 % L1 accuracy, 31.54 % coverage, and negligible hardware overhead relative to a baseline without prefetching.

Significance. If the simulation results hold, the work demonstrates a low-cost, high-accuracy prefetching technique that exploits an existing RT hardware structure (the traversal stack) rather than adding new prediction tables. The reported correlation between coverage and speedup, together with the near-negligible area cost, would be a useful contribution to memory-latency mitigation in dedicated ray-tracing hardware.

major comments (2)

[Evaluation] Evaluation section: all speedup, accuracy, and coverage numbers are obtained from Vulkan-sim 2.0 runs; the manuscript provides no cross-validation against real RT cores, alternative simulators, or sensitivity analysis to stack-pop timing and memory latency parameters. Because the prefetch logic is triggered specifically by consecutive stack pops, any mismatch in modeled stack behavior directly undermines the 1.48× speedup and 98.92 % accuracy claims.
[Evaluation] The coverage metric (31.54 %) is defined as L1-miss reduction relative to baseline, yet the text does not quantify how much of the observed speedup is attributable to latency hiding versus other simulator effects (e.g., changes in thread scheduling). This link is load-bearing for the central performance claim.

minor comments (2)

[Abstract] The abstract states 'nearly negligible hardware overhead' without quoting concrete area or power numbers; these figures should appear in the main text or a table.
[Abstract] Benchmark scenes, triangle counts, and exact baseline configuration (cache sizes, memory latency, etc.) are not summarized in the abstract; a short table or sentence would improve readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment on the evaluation methodology below, providing clarifications and committing to revisions where appropriate to strengthen the paper.

read point-by-point responses

Referee: [Evaluation] Evaluation section: all speedup, accuracy, and coverage numbers are obtained from Vulkan-sim 2.0 runs; the manuscript provides no cross-validation against real RT cores, alternative simulators, or sensitivity analysis to stack-pop timing and memory latency parameters. Because the prefetch logic is triggered specifically by consecutive stack pops, any mismatch in modeled stack behavior directly undermines the 1.48× speedup and 98.92 % accuracy claims.

Authors: We agree that additional validation would be valuable. Cross-validation on real RT cores is not feasible in this work, as commercial ray-tracing hardware implementations and their detailed microarchitectural models are proprietary. Vulkan-sim 2.0 is a cycle-accurate simulator specifically designed for Vulkan and ray-tracing workloads and is commonly used in the literature for such studies. To address the concern about sensitivity to modeling assumptions, we will add a new subsection in the revised manuscript presenting sensitivity analysis on memory latency parameters and variations in the timing of consecutive stack pops. revision: partial
Referee: [Evaluation] The coverage metric (31.54 %) is defined as L1-miss reduction relative to baseline, yet the text does not quantify how much of the observed speedup is attributable to latency hiding versus other simulator effects (e.g., changes in thread scheduling). This link is load-bearing for the central performance claim.

Authors: We acknowledge that a more explicit breakdown would improve the clarity of the performance results. The reported correlation between coverage and speedup, combined with the very high L1 accuracy, indicates that the gains stem primarily from reduced memory latency. In the revision we will add execution-time breakdowns and memory-stall-cycle statistics to better quantify the contribution of latency hiding versus other simulator effects such as scheduling. revision: yes

standing simulated objections not resolved

Direct validation against real commercial RT cores or alternative closed-source simulators, as we lack access to proprietary hardware models.

Circularity Check

0 steps flagged

No circularity: TTP design evaluated via independent simulation measurements

full rationale

The paper proposes TTP as a hardware prefetcher that triggers on consecutive pops from the existing RT traversal stack to prefetch BVH nodes during upward DFS traversal. All reported results (1.48x average speedup, 98.92% L1 accuracy defined as prefetched blocks later referenced, 31.54% coverage as miss reduction ratio) are measured outcomes from separate cycle-level simulation runs on Vulkan-sim 2.0. These quantities are not algebraically or definitionally forced by the prefetch rule itself; they are empirical outputs. No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes smuggled via prior work appear in the derivation. The design is self-contained; simulator fidelity is an external validity assumption, not an internal circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on standard domain assumptions about DFS traversal in ray tracing hardware and uses simulation for validation rather than introducing new fitted parameters or entities.

axioms (1)

domain assumption Ray tracing hardware performs DFS-based BVH traversal and maintains a per-thread traversal stack of node addresses.
This assumption is used to justify prefetch generation when nodes are popped consecutively from the stack.

pith-pipeline@v0.9.0 · 5860 in / 1204 out tokens · 48355 ms · 2026-05-19T18:11:46.288479+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate TTP on a cycle-level simulator, Vulkan-sim 2.0, and show that it achieves 1.48x speedup

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

[1]

Code repo for Treelet Prefetching For Ray Tracing (MICRO 2023)

“Code repo for Treelet Prefetching For Ray Tracing (MICRO 2023).” [Online]. Available: https://github.com/ubc-aamodt-group/ treelet-prefetching-for-rt

work page 2023
[2]

DirectX Raytracing (DXR) Functional Spec

“DirectX Raytracing (DXR) Functional Spec.” [Online]. Available: https://microsoft.github.io/DirectX-Specs/d3d/Raytracing.html

work page
[3]

Intel Embree

“Intel Embree.” [Online]. Available: https://www.embree.org/

work page
[4]

Intel® Arc™ Graphics Developer Guide for Real-Time Ray Tracing in

“Intel® Arc™ Graphics Developer Guide for Real-Time Ray Tracing in...” [Online]. Available: https://www.intel.com/content/www/us/en/ developer/articles/guide/real-time-ray-tracing-in-games.html

work page
[5]

NVIDIA ADA GPU ARCHITECTURE

“NVIDIA ADA GPU ARCHITECTURE.” [Online]. Avail- able: https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia- ada-gpu-architecture.pdf

work page
[6]

NVIDIA AMPERE GA102 GPU ARCHITECTURE

“NVIDIA AMPERE GA102 GPU ARCHITECTURE.” [Online]. Available: https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102- gpu-architecture-whitepaper-v2.pdf

work page
[7]

Real-Time Ray Tracing

“Real-Time Ray Tracing.” [Online]. Available: https: //dev.epicgames.com/documentation/en-us/unreal-engine/hardware- ray-tracing-tips-and-tricks-in-unreal-engine

work page
[8]

Real-time Raytracing for Interactive Global Illumination Workflows in Frostbite

“Real-time Raytracing for Interactive Global Illumination Workflows in Frostbite.” [Online]. Available: https://www.gdcvault.com/play/1024801/

work page arXiv
[9]

The Stanford 3D Scanning Repository

“The Stanford 3D Scanning Repository.” [Online]. Available: https: //graphics.stanford.edu/data/3Dscanrep/

work page
[10]

Unity real-time Ray Tracing

“Unity real-time Ray Tracing.” [Online]. Available: https://unity.com/ ray-tracing

work page
[11]

Architecture considerations for tracing incoherent rays,

T. Aila and T. Karras, “Architecture considerations for tracing incoherent rays,” inProceedings of the Conference on High Performance Graphics, ser. HPG ’10. Goslar, DEU: Eurographics Association, Jun. 2010, pp. 113–122

work page 2010
[12]

Understanding the efficiency of ray traversal on GPUs,

T. Aila and S. Laine, “Understanding the efficiency of ray traversal on GPUs,” inProceedings of the Conference on High Performance Graphics 2009, ser. HPG ’09. New York, NY , USA: Association for Computing Machinery, Aug. 2009, pp. 145–149. [Online]. Available: https://doi.org/10.1145/1572769.1572792

work page doi:10.1145/1572769.1572792 2009
[13]

Graph Prefetching Using Data Structure Knowledge,

S. Ainsworth and T. M. Jones, “Graph Prefetching Using Data Structure Knowledge,” inProceedings of the 2016 International Conference on Supercomputing, ser. ICS ’16. New York, NY , USA: Association for Computing Machinery, Jun. 2016, pp. 1–11. [Online]. Available: https://doi.org/10.1145/2925426.2926254

work page doi:10.1145/2925426.2926254 2016
[14]

Extending GPU Ray-Tracing Units for Hierarchical Search Acceleration,

A. Barnes, F. Shen, and T. G. Rogers, “Extending GPU Ray-Tracing Units for Hierarchical Search Acceleration,” inProceedings of the 57th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’24. New York, NY , USA: Association for Computing Machinery, Nov. 2024

work page 2024
[15]

Parallel breadth-first search on distributed memory systems,

A. Buluc ¸ and K. Madduri, “Parallel breadth-first search on distributed memory systems,” inProceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’11. New York, NY , USA: Association for Computing Machinery, Nov. 2011, pp. 1–12. [Online]. Available: https://dl.acm.org/doi/10. 1145/2063384.2063471

work page arXiv 2011
[16]

RTX on—The NVIDIA Turing GPU,

J. Burgess, “RTX on—The NVIDIA Turing GPU,”IEEE Micro, vol. 40, no. 2, pp. 36–44, Mar. 2020, conference Name: IEEE Micro. [Online]. Available: https://ieeexplore.ieee.org/document/8981896

work page arXiv 2020
[17]

What’s the Difference Between Ray Tracing and Rasterization?

B. Caulfield, “What’s the Difference Between Ray Tracing and Rasterization?” Mar. 2018. [Online]. Available: https://blogs.nvidia. com/blog/whats-difference-between-ray-tracing-rasterization/

work page 2018
[18]

Treelet Accelerated Ray Tracing on GPUs,

Y . H. Chou and T. M. Aamodt, “Treelet Accelerated Ray Tracing on GPUs,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’25. New York, NY , USA: Association for Computing Machinery, Mar. 2025, pp. 1334–1347. [Online]. Available: https://dl.acm.org/doi/1...

work page doi:10.1145/3676641.3716279 2025
[19]

Treelet Prefetching For Ray Tracing,

Y . H. Chou, T. Nowicki, and T. M. Aamodt, “Treelet Prefetching For Ray Tracing,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY , USA: Association for Computing Machinery, Dec. 2023, pp. 742–755

work page 2023
[20]

Ray Tracing for the Movie ‘Cars’,

P. H. Christensen, J. Fong, D. M. Laur, and D. Batali, “Ray Tracing for the Movie ‘Cars’,” in2006 IEEE Symposium on Interactive Ray Tracing, Sep. 2006, pp. 1–6. [Online]. Available: https://ieeexplore.ieee.org/document/4061539

work page arXiv 2006
[21]

Toward Real-Time Ray Tracing: A Survey on Hardware Acceleration and Microarchitecture Techniques,

Y . Deng, Y . Ni, Z. Li, S. Mu, and W. Zhang, “Toward Real-Time Ray Tracing: A Survey on Hardware Acceleration and Microarchitecture Techniques,”ACM Computing Surveys, vol. 50, no. 4, pp. 58:1–58:41, Aug. 2017. [Online]. Available: https://doi.org/10.1145/3104067

work page doi:10.1145/3104067 2017
[22]

Ansmet: Approximate nearest neighbor search with near-memory processing and hybrid early termination,

Y . Feng, Y . Li, J. Lee, W. W. Ro, and H. Jeon, “Heliostat: Harnessing Ray Tracing Accelerators for Page Table Walks,” in Proceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25. New York, NY , USA: Association for Computing Machinery, Jun. 2025, pp. 122–136. [Online]. Available: https://dl.acm.org/doi/10.1145/36950...

work page doi:10.1145/3695053.3731011 2025
[23]

Stride Directed Prefetching In Scalar Processors,

J. Fu, J. Patel, and B. Janssens, “Stride Directed Prefetching In Scalar Processors,” in[1992] Proceedings the 25th Annual International Symposium on Microarchitecture MICRO 25, Dec. 1992, pp. 102–110. [Online]. Available: https://ieeexplore.ieee.org/document/697004

work page 1992
[24]

LibRTS: A Spatial Indexing Library by Ray Tracing,

L. Geng, R. Lee, and X. Zhang, “LibRTS: A Spatial Indexing Library by Ray Tracing,” inProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’25. New York, NY , USA: Association for Computing Machinery, Feb. 2025, pp. 396–411. [Online]. Available: https://dl.acm.org/doi/10.1145/3710848.3710850

work page doi:10.1145/3710848.3710850 2025
[25]

Realtime Ray Tracing on GPU with BVH-based Packet Traversal,

J. Gunther, S. Popov, H.-P. Seidel, and P. Slusallek, “Realtime Ray Tracing on GPU with BVH-based Packet Traversal,” in2007 IEEE Symposium on Interactive Ray Tracing, Sep. 2007, pp. 113–118. [Online]. Available: https://ieeexplore.ieee.org/document/4342598

work page arXiv 2007
[26]

Generalizing Ray Tracing Accelerators for Tree Traversals on GPUs,

D. Ha, L. Liu, Y . H. Chou, S. Go, W. W. Ro, H.-W. Tseng, and T. M. Aamodt, “Generalizing Ray Tracing Accelerators for Tree Traversals on GPUs,” inProceedings of the 57th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’24. New York, NY , USA: Association for Computing Machinery, Nov. 2024

work page 2024
[27]

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers,

N. Jouppi, “Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers,” in

work page
[28]

The 17th Annual International Symposium on Computer Architecture, May 1990, pp

Proceedings. The 17th Annual International Symposium on Computer Architecture, May 1990, pp. 364–373. [Online]. Available: https://ieeexplore.ieee.org/document/134547

work page 1990
[29]

Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling,

M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), May 2020, pp. 473–486. [Online]. Available: https://ieeexplore.ieee.org/document/9138922

work page arXiv 2020
[30]

Many- Thread Aware Prefetching Mechanisms for GPGPU Applications,

J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc, “Many- Thread Aware Prefetching Mechanisms for GPGPU Applications,” in2010 43rd Annual IEEE/ACM International Symposium on 13 Microarchitecture, Dec. 2010, pp. 213–224, iSSN: 2379-3155. [Online]. Available: https://ieeexplore.ieee.org/document/5695538

work page arXiv 2010
[31]

GPUWattch: enabling energy optimizations in GPGPUs,

J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V . J. Reddi, “GPUWattch: enabling energy optimizations in GPGPUs,”ACM SIGARCH Computer Architecture News, vol. 41, no. 3, pp. 487–498, Jun. 2013. [Online]. Available: https://doi.org/10.1145/2508148.2485964

work page doi:10.1145/2508148.2485964 2013
[32]

Intersection Prediction for Accelerated GPU Ray Tracing,

L. Liu, W. Chang, F. Demoullin, Y . H. Chou, M. Saed, D. Pankratz, T. Nowicki, and T. M. Aamodt, “Intersection Prediction for Accelerated GPU Ray Tracing,” inMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’21. New York, NY , USA: Association for Computing Machinery, Oct. 2021, pp. 709–723. [Online]. Available: http...

work page doi:10.1145/3466752.3480097 2021
[33]

LumiBench: A Benchmark Suite for Hardware Ray Tracing,

L. Liu, M. Saed, Y . H. Chou, D. Grigoryan, T. Nowicki, and T. M. Aamodt, “LumiBench: A Benchmark Suite for Hardware Ray Tracing,” in2023 IEEE International Symposium on Workload Characterization (IISWC), Oct. 2023, pp. 1–14, iSSN: 2835-2238. [Online]. Available: https://ieeexplore.ieee.org/document/10289559

work page arXiv 2023
[34]

Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads,

P. Liu, J. Yu, and M. C. Huang, “Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads,”ACM Trans. Archit. Code Optim., vol. 13, no. 1, pp. 13:1– 13:25, Mar. 2016. [Online]. Available: https://doi.org/10.1145/2890505

work page doi:10.1145/2890505 2016
[35]

An effective GPU implementation of breadth-first search,

L. Luo, M. Wong, and W.-m. Hwu, “An effective GPU implementation of breadth-first search,” inProceedings of the 47th Design Automation Conference, ser. DAC ’10. New York, NY , USA: Association for Computing Machinery, Jun. 2010, pp. 52–55. [Online]. Available: https://dl.acm.org/doi/10.1145/1837274.1837289

work page doi:10.1145/1837274.1837289 2010
[36]

Arkade: k-Nearest Neighbor Search With Non-Euclidean Distances using GPU Ray Tracing,

D. K. Mandarapu, V . Nagarajan, A. Pelenitsyn, and M. Kulkarni, “Arkade: k-Nearest Neighbor Search With Non-Euclidean Distances using GPU Ray Tracing,” inProceedings of the 38th ACM International Conference on Supercomputing, ser. ICS ’24. New York, NY , USA: Association for Computing Machinery, Jun. 2024, pp. 14–25. [Online]. Available: https://dl.acm.or...

work page doi:10.1145/3650200.3656601 2024
[37]

Scalable GPU graph traversal,

D. Merrill, M. Garland, and A. Grimshaw, “Scalable GPU graph traversal,”SIGPLAN Not., vol. 47, no. 8, pp. 117–128, Feb. 2012. [Online]. Available: https://doi.org/10.1145/2370036.2145832

work page doi:10.1145/2370036.2145832 2012
[38]

Data Cache Prefetching Using a Global History Buffer,

K. Nesbit and J. Smith, “Data Cache Prefetching Using a Global History Buffer,” in10th International Symposium on High Performance Computer Architecture (HPCA’04), Feb. 2004, pp. 96–96, iSSN: 1530-

work page 2004
[39]

Available: https://ieeexplore.ieee.org/document/1410068

[Online]. Available: https://ieeexplore.ieee.org/document/1410068

work page arXiv
[40]

Node pre-fetching ar- chitecture for real-time ray tracing,

J.-s. Park, W.-c. Park, J.-H. Nah, and T.-d. Han, “Node pre-fetching ar- chitecture for real-time ray tracing,”IEICE Electronics Express, vol. 10, no. 14, pp. 20 130 468–20 130 468, 2013

work page 2013
[41]

Vulkan- Sim: A GPU Architecture Simulator for Ray Tracing,

M. Saed, Y . H. Chou, L. Liu, T. Nowicki, and T. M. Aamodt, “Vulkan- Sim: A GPU Architecture Simulator for Ray Tracing,” in2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct. 2022, pp. 263–281. [Online]. Available: https://ieeexplore.ieee.org/ document/9923844/citations?tabFilter=papers#citations

work page arXiv 2022
[42]

RayN: Ray Tracing Acceleration with Near-memory Computing,

M. Saed, P. J. Nair, and T. M. Aamodt, “RayN: Ray Tracing Acceleration with Near-memory Computing,” inProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’25. New York, NY , USA: Association for Computing Machinery, Oct. 2025, pp. 277–291. [Online]. Available: https: //dl.acm.org/doi/10.1145/3725843.3756067

work page doi:10.1145/3725843.3756067 2025
[43]

FreePDK: An Open-Source Variation-Aware Design Kit,

J. E. Stine, I. Castellanos, M. Wood, J. Henson, F. Love, W. R. Davis, P. D. Franzon, M. Bucher, S. Basavarajaiah, J. Oh, and R. Jenkal, “FreePDK: An Open-Source Variation-Aware Design Kit,” in2007 IEEE International Conference on Microelectronic Systems Education (MSE’07), Jun. 2007, pp. 173–174. [Online]. Available: https://ieeexplore.ieee.org/document/4231502

work page arXiv 2007
[44]

Home|Vulkan|Cross platform 3D Graphics,

Vulkan, “Home|Vulkan|Cross platform 3D Graphics,” May 2024. [Online]. Available: https://vulkan.org/

work page 2024
[45]

IMP: indirect memory prefetcher,

X. Yu, C. J. Hughes, N. Satish, and S. Devadas, “IMP: indirect memory prefetcher,” inProceedings of the 48th International Symposium on Microarchitecture, ser. MICRO-48. New York, NY , USA: Association for Computing Machinery, Dec. 2015, pp. 178–190. [Online]. Available: https://doi.org/10.1145/2830772.2830807

work page doi:10.1145/2830772.2830807 2015
[46]

Drex: Accurate and scalable dense retrieval acceleration via algorithmic-hardware codesign,

H. Zhang, Y . Zhang, and H.-W. Tseng, “RTSpMSpM: Harnessing Ray Tracing for Efficient Sparse Matrix Computations,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25. New York, NY , USA: Association for Computing Machinery, Jun. 2025, pp. 359–373. [Online]. Available: https: //dl.acm.org/doi/10.1145/3695053.3731072 14

work page doi:10.1145/3695053.3731072 2025

[1] [1]

Code repo for Treelet Prefetching For Ray Tracing (MICRO 2023)

“Code repo for Treelet Prefetching For Ray Tracing (MICRO 2023).” [Online]. Available: https://github.com/ubc-aamodt-group/ treelet-prefetching-for-rt

work page 2023

[2] [2]

DirectX Raytracing (DXR) Functional Spec

“DirectX Raytracing (DXR) Functional Spec.” [Online]. Available: https://microsoft.github.io/DirectX-Specs/d3d/Raytracing.html

work page

[3] [3]

Intel Embree

“Intel Embree.” [Online]. Available: https://www.embree.org/

work page

[4] [4]

Intel® Arc™ Graphics Developer Guide for Real-Time Ray Tracing in

“Intel® Arc™ Graphics Developer Guide for Real-Time Ray Tracing in...” [Online]. Available: https://www.intel.com/content/www/us/en/ developer/articles/guide/real-time-ray-tracing-in-games.html

work page

[5] [5]

NVIDIA ADA GPU ARCHITECTURE

“NVIDIA ADA GPU ARCHITECTURE.” [Online]. Avail- able: https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia- ada-gpu-architecture.pdf

work page

[6] [6]

NVIDIA AMPERE GA102 GPU ARCHITECTURE

“NVIDIA AMPERE GA102 GPU ARCHITECTURE.” [Online]. Available: https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102- gpu-architecture-whitepaper-v2.pdf

work page

[7] [7]

Real-Time Ray Tracing

“Real-Time Ray Tracing.” [Online]. Available: https: //dev.epicgames.com/documentation/en-us/unreal-engine/hardware- ray-tracing-tips-and-tricks-in-unreal-engine

work page

[8] [8]

Real-time Raytracing for Interactive Global Illumination Workflows in Frostbite

“Real-time Raytracing for Interactive Global Illumination Workflows in Frostbite.” [Online]. Available: https://www.gdcvault.com/play/1024801/

work page arXiv

[9] [9]

The Stanford 3D Scanning Repository

“The Stanford 3D Scanning Repository.” [Online]. Available: https: //graphics.stanford.edu/data/3Dscanrep/

work page

[10] [10]

Unity real-time Ray Tracing

“Unity real-time Ray Tracing.” [Online]. Available: https://unity.com/ ray-tracing

work page

[11] [11]

Architecture considerations for tracing incoherent rays,

T. Aila and T. Karras, “Architecture considerations for tracing incoherent rays,” inProceedings of the Conference on High Performance Graphics, ser. HPG ’10. Goslar, DEU: Eurographics Association, Jun. 2010, pp. 113–122

work page 2010

[12] [12]

Understanding the efficiency of ray traversal on GPUs,

T. Aila and S. Laine, “Understanding the efficiency of ray traversal on GPUs,” inProceedings of the Conference on High Performance Graphics 2009, ser. HPG ’09. New York, NY , USA: Association for Computing Machinery, Aug. 2009, pp. 145–149. [Online]. Available: https://doi.org/10.1145/1572769.1572792

work page doi:10.1145/1572769.1572792 2009

[13] [13]

Graph Prefetching Using Data Structure Knowledge,

S. Ainsworth and T. M. Jones, “Graph Prefetching Using Data Structure Knowledge,” inProceedings of the 2016 International Conference on Supercomputing, ser. ICS ’16. New York, NY , USA: Association for Computing Machinery, Jun. 2016, pp. 1–11. [Online]. Available: https://doi.org/10.1145/2925426.2926254

work page doi:10.1145/2925426.2926254 2016

[14] [14]

Extending GPU Ray-Tracing Units for Hierarchical Search Acceleration,

A. Barnes, F. Shen, and T. G. Rogers, “Extending GPU Ray-Tracing Units for Hierarchical Search Acceleration,” inProceedings of the 57th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’24. New York, NY , USA: Association for Computing Machinery, Nov. 2024

work page 2024

[15] [15]

Parallel breadth-first search on distributed memory systems,

A. Buluc ¸ and K. Madduri, “Parallel breadth-first search on distributed memory systems,” inProceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’11. New York, NY , USA: Association for Computing Machinery, Nov. 2011, pp. 1–12. [Online]. Available: https://dl.acm.org/doi/10. 1145/2063384.2063471

work page arXiv 2011

[16] [16]

RTX on—The NVIDIA Turing GPU,

J. Burgess, “RTX on—The NVIDIA Turing GPU,”IEEE Micro, vol. 40, no. 2, pp. 36–44, Mar. 2020, conference Name: IEEE Micro. [Online]. Available: https://ieeexplore.ieee.org/document/8981896

work page arXiv 2020

[17] [17]

What’s the Difference Between Ray Tracing and Rasterization?

B. Caulfield, “What’s the Difference Between Ray Tracing and Rasterization?” Mar. 2018. [Online]. Available: https://blogs.nvidia. com/blog/whats-difference-between-ray-tracing-rasterization/

work page 2018

[18] [18]

Treelet Accelerated Ray Tracing on GPUs,

Y . H. Chou and T. M. Aamodt, “Treelet Accelerated Ray Tracing on GPUs,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’25. New York, NY , USA: Association for Computing Machinery, Mar. 2025, pp. 1334–1347. [Online]. Available: https://dl.acm.org/doi/1...

work page doi:10.1145/3676641.3716279 2025

[19] [19]

Treelet Prefetching For Ray Tracing,

Y . H. Chou, T. Nowicki, and T. M. Aamodt, “Treelet Prefetching For Ray Tracing,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY , USA: Association for Computing Machinery, Dec. 2023, pp. 742–755

work page 2023

[20] [20]

Ray Tracing for the Movie ‘Cars’,

P. H. Christensen, J. Fong, D. M. Laur, and D. Batali, “Ray Tracing for the Movie ‘Cars’,” in2006 IEEE Symposium on Interactive Ray Tracing, Sep. 2006, pp. 1–6. [Online]. Available: https://ieeexplore.ieee.org/document/4061539

work page arXiv 2006

[21] [21]

Toward Real-Time Ray Tracing: A Survey on Hardware Acceleration and Microarchitecture Techniques,

Y . Deng, Y . Ni, Z. Li, S. Mu, and W. Zhang, “Toward Real-Time Ray Tracing: A Survey on Hardware Acceleration and Microarchitecture Techniques,”ACM Computing Surveys, vol. 50, no. 4, pp. 58:1–58:41, Aug. 2017. [Online]. Available: https://doi.org/10.1145/3104067

work page doi:10.1145/3104067 2017

[22] [22]

Ansmet: Approximate nearest neighbor search with near-memory processing and hybrid early termination,

Y . Feng, Y . Li, J. Lee, W. W. Ro, and H. Jeon, “Heliostat: Harnessing Ray Tracing Accelerators for Page Table Walks,” in Proceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25. New York, NY , USA: Association for Computing Machinery, Jun. 2025, pp. 122–136. [Online]. Available: https://dl.acm.org/doi/10.1145/36950...

work page doi:10.1145/3695053.3731011 2025

[23] [23]

Stride Directed Prefetching In Scalar Processors,

J. Fu, J. Patel, and B. Janssens, “Stride Directed Prefetching In Scalar Processors,” in[1992] Proceedings the 25th Annual International Symposium on Microarchitecture MICRO 25, Dec. 1992, pp. 102–110. [Online]. Available: https://ieeexplore.ieee.org/document/697004

work page 1992

[24] [24]

LibRTS: A Spatial Indexing Library by Ray Tracing,

L. Geng, R. Lee, and X. Zhang, “LibRTS: A Spatial Indexing Library by Ray Tracing,” inProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’25. New York, NY , USA: Association for Computing Machinery, Feb. 2025, pp. 396–411. [Online]. Available: https://dl.acm.org/doi/10.1145/3710848.3710850

work page doi:10.1145/3710848.3710850 2025

[25] [25]

Realtime Ray Tracing on GPU with BVH-based Packet Traversal,

J. Gunther, S. Popov, H.-P. Seidel, and P. Slusallek, “Realtime Ray Tracing on GPU with BVH-based Packet Traversal,” in2007 IEEE Symposium on Interactive Ray Tracing, Sep. 2007, pp. 113–118. [Online]. Available: https://ieeexplore.ieee.org/document/4342598

work page arXiv 2007

[26] [26]

Generalizing Ray Tracing Accelerators for Tree Traversals on GPUs,

D. Ha, L. Liu, Y . H. Chou, S. Go, W. W. Ro, H.-W. Tseng, and T. M. Aamodt, “Generalizing Ray Tracing Accelerators for Tree Traversals on GPUs,” inProceedings of the 57th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’24. New York, NY , USA: Association for Computing Machinery, Nov. 2024

work page 2024

[27] [27]

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers,

N. Jouppi, “Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers,” in

work page

[28] [28]

The 17th Annual International Symposium on Computer Architecture, May 1990, pp

Proceedings. The 17th Annual International Symposium on Computer Architecture, May 1990, pp. 364–373. [Online]. Available: https://ieeexplore.ieee.org/document/134547

work page 1990

[29] [29]

Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling,

M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), May 2020, pp. 473–486. [Online]. Available: https://ieeexplore.ieee.org/document/9138922

work page arXiv 2020

[30] [30]

Many- Thread Aware Prefetching Mechanisms for GPGPU Applications,

J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc, “Many- Thread Aware Prefetching Mechanisms for GPGPU Applications,” in2010 43rd Annual IEEE/ACM International Symposium on 13 Microarchitecture, Dec. 2010, pp. 213–224, iSSN: 2379-3155. [Online]. Available: https://ieeexplore.ieee.org/document/5695538

work page arXiv 2010

[31] [31]

GPUWattch: enabling energy optimizations in GPGPUs,

J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V . J. Reddi, “GPUWattch: enabling energy optimizations in GPGPUs,”ACM SIGARCH Computer Architecture News, vol. 41, no. 3, pp. 487–498, Jun. 2013. [Online]. Available: https://doi.org/10.1145/2508148.2485964

work page doi:10.1145/2508148.2485964 2013

[32] [32]

Intersection Prediction for Accelerated GPU Ray Tracing,

L. Liu, W. Chang, F. Demoullin, Y . H. Chou, M. Saed, D. Pankratz, T. Nowicki, and T. M. Aamodt, “Intersection Prediction for Accelerated GPU Ray Tracing,” inMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’21. New York, NY , USA: Association for Computing Machinery, Oct. 2021, pp. 709–723. [Online]. Available: http...

work page doi:10.1145/3466752.3480097 2021

[33] [33]

LumiBench: A Benchmark Suite for Hardware Ray Tracing,

L. Liu, M. Saed, Y . H. Chou, D. Grigoryan, T. Nowicki, and T. M. Aamodt, “LumiBench: A Benchmark Suite for Hardware Ray Tracing,” in2023 IEEE International Symposium on Workload Characterization (IISWC), Oct. 2023, pp. 1–14, iSSN: 2835-2238. [Online]. Available: https://ieeexplore.ieee.org/document/10289559

work page arXiv 2023

[34] [34]

Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads,

P. Liu, J. Yu, and M. C. Huang, “Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads,”ACM Trans. Archit. Code Optim., vol. 13, no. 1, pp. 13:1– 13:25, Mar. 2016. [Online]. Available: https://doi.org/10.1145/2890505

work page doi:10.1145/2890505 2016

[35] [35]

An effective GPU implementation of breadth-first search,

L. Luo, M. Wong, and W.-m. Hwu, “An effective GPU implementation of breadth-first search,” inProceedings of the 47th Design Automation Conference, ser. DAC ’10. New York, NY , USA: Association for Computing Machinery, Jun. 2010, pp. 52–55. [Online]. Available: https://dl.acm.org/doi/10.1145/1837274.1837289

work page doi:10.1145/1837274.1837289 2010

[36] [36]

Arkade: k-Nearest Neighbor Search With Non-Euclidean Distances using GPU Ray Tracing,

D. K. Mandarapu, V . Nagarajan, A. Pelenitsyn, and M. Kulkarni, “Arkade: k-Nearest Neighbor Search With Non-Euclidean Distances using GPU Ray Tracing,” inProceedings of the 38th ACM International Conference on Supercomputing, ser. ICS ’24. New York, NY , USA: Association for Computing Machinery, Jun. 2024, pp. 14–25. [Online]. Available: https://dl.acm.or...

work page doi:10.1145/3650200.3656601 2024

[37] [37]

Scalable GPU graph traversal,

D. Merrill, M. Garland, and A. Grimshaw, “Scalable GPU graph traversal,”SIGPLAN Not., vol. 47, no. 8, pp. 117–128, Feb. 2012. [Online]. Available: https://doi.org/10.1145/2370036.2145832

work page doi:10.1145/2370036.2145832 2012

[38] [38]

Data Cache Prefetching Using a Global History Buffer,

K. Nesbit and J. Smith, “Data Cache Prefetching Using a Global History Buffer,” in10th International Symposium on High Performance Computer Architecture (HPCA’04), Feb. 2004, pp. 96–96, iSSN: 1530-

work page 2004

[39] [39]

Available: https://ieeexplore.ieee.org/document/1410068

[Online]. Available: https://ieeexplore.ieee.org/document/1410068

work page arXiv

[40] [40]

Node pre-fetching ar- chitecture for real-time ray tracing,

J.-s. Park, W.-c. Park, J.-H. Nah, and T.-d. Han, “Node pre-fetching ar- chitecture for real-time ray tracing,”IEICE Electronics Express, vol. 10, no. 14, pp. 20 130 468–20 130 468, 2013

work page 2013

[41] [41]

Vulkan- Sim: A GPU Architecture Simulator for Ray Tracing,

M. Saed, Y . H. Chou, L. Liu, T. Nowicki, and T. M. Aamodt, “Vulkan- Sim: A GPU Architecture Simulator for Ray Tracing,” in2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct. 2022, pp. 263–281. [Online]. Available: https://ieeexplore.ieee.org/ document/9923844/citations?tabFilter=papers#citations

work page arXiv 2022

[42] [42]

RayN: Ray Tracing Acceleration with Near-memory Computing,

M. Saed, P. J. Nair, and T. M. Aamodt, “RayN: Ray Tracing Acceleration with Near-memory Computing,” inProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’25. New York, NY , USA: Association for Computing Machinery, Oct. 2025, pp. 277–291. [Online]. Available: https: //dl.acm.org/doi/10.1145/3725843.3756067

work page doi:10.1145/3725843.3756067 2025

[43] [43]

FreePDK: An Open-Source Variation-Aware Design Kit,

J. E. Stine, I. Castellanos, M. Wood, J. Henson, F. Love, W. R. Davis, P. D. Franzon, M. Bucher, S. Basavarajaiah, J. Oh, and R. Jenkal, “FreePDK: An Open-Source Variation-Aware Design Kit,” in2007 IEEE International Conference on Microelectronic Systems Education (MSE’07), Jun. 2007, pp. 173–174. [Online]. Available: https://ieeexplore.ieee.org/document/4231502

work page arXiv 2007

[44] [44]

Home|Vulkan|Cross platform 3D Graphics,

Vulkan, “Home|Vulkan|Cross platform 3D Graphics,” May 2024. [Online]. Available: https://vulkan.org/

work page 2024

[45] [45]

IMP: indirect memory prefetcher,

X. Yu, C. J. Hughes, N. Satish, and S. Devadas, “IMP: indirect memory prefetcher,” inProceedings of the 48th International Symposium on Microarchitecture, ser. MICRO-48. New York, NY , USA: Association for Computing Machinery, Dec. 2015, pp. 178–190. [Online]. Available: https://doi.org/10.1145/2830772.2830807

work page doi:10.1145/2830772.2830807 2015

[46] [46]

Drex: Accurate and scalable dense retrieval acceleration via algorithmic-hardware codesign,

H. Zhang, Y . Zhang, and H.-W. Tseng, “RTSpMSpM: Harnessing Ray Tracing for Efficient Sparse Matrix Computations,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25. New York, NY , USA: Association for Computing Machinery, Jun. 2025, pp. 359–373. [Online]. Available: https: //dl.acm.org/doi/10.1145/3695053.3731072 14

work page doi:10.1145/3695053.3731072 2025