TTP: A Hardware-Efficient Design for Precise Prefetching in Ray Tracing
Pith reviewed 2026-05-19 18:11 UTC · model grok-4.3
The pith
A prefetcher that monitors consecutive pops from ray tracing traversal stacks delivers 1.48x average speedup with negligible hardware overhead.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TTP prefetches nodes using the addresses already available on the hardware traversal stacks of each thread. For DFS-based traversal, prefetches are generated when nodes are being popped consecutively from the traversal stack, potentially corresponding to upward traversal through the tree. Evaluated on the cycle-level simulator Vulkan-sim 2.0, TTP achieves 1.48x speedup on average compared to the baseline with nearly negligible hardware overhead, 98.92 percent average L1 accuracy, and 31.54 percent coverage.
What carries the argument
Tree Traversal Prefetcher (TTP), which issues prefetch requests from addresses on the existing hardware traversal stack during consecutive pops in depth-first BVH traversal.
If this is right
- Ray tracing workloads experience fewer memory stalls during BVH traversal, increasing overall throughput.
- High prefetch accuracy keeps cache pollution low and avoids wasting bandwidth on unused data.
- The design integrates into existing RT cores because it reuses already-present stack hardware.
- Speedup scales with scene size and complexity where memory latency dominates computation.
Where Pith is reading between the lines
- Stack-based monitoring may extend to prefetching in other tree-traversal workloads such as spatial databases or physics simulations.
- Future hardware revisions could expose additional traversal state to enable even more targeted prefetch decisions.
- The approach suggests that domain-specific knowledge of internal hardware structures can outperform generic prefetch algorithms.
Load-bearing premise
The cycle-level simulator accurately reproduces the memory access patterns and traversal stack behavior of real ray tracing hardware, and consecutive stack pops reliably indicate useful upward traversal without causing cache pollution.
What would settle it
Measuring prefetch accuracy and overall speedup when the same TTP logic is placed in real ray tracing silicon and run on physical GPUs instead of a cycle-level simulator.
Figures
read the original abstract
Ray tracing (RT) is a 3D graphics technique that offers highly realistic visuals. It is becoming prominent and accessible as GPU vendors have integrated dedicated ray tracing acceleration hardware. However, tracing millions of rays through 3D scenes consisting of high numbers of triangles in real time is challenging and requires expensive hardware. The main bottleneck in RT workloads is the expensive Bounding Volume Hierarchy (BVH) traversal task, which is a large tree structure that encodes the 3D scene. BVH traversal is a memory-bound problem, as the GPU threads spend most of their time reading tree node data from memory. In this work, we attack the memory latency bottleneck of ray tracing through prefetching. We propose a novel hardware prefetcher, named Tree Traversal Prefetcher (TTP), for ray tracing. The main idea is to leverage the existing tree traversal stack in the RT units for highly accurate prefetching. In particular, TTP prefetches nodes using the addresses already available on the hardware traversal stacks of each thread. For DFS (Depth-first search) based traversal, prefetches are generated when nodes are being popped consecutively from the traversal stack, potentially corresponding to upward traversal through the tree. We evaluate TTP on a cycle-level simulator, Vulkan-sim 2.0, and show that it achieves 1.48x speedup on average (up to 1.89x) compared to the baseline, with nearly negligible hardware overhead. TTP achieves 98.92% average L1 accuracy, which is the ratio of the prefetched blocks being actually referenced by demand loads. The coverage, computed as the ratio of L1 miss reduction over baseline L1 misses, is 31.54%, correlating well with the achieved speedup.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Tree Traversal Prefetcher (TTP), a hardware prefetcher for ray tracing that leverages addresses already present on the traversal stacks of RT units. Prefetches are triggered on consecutive pops from the stack (assumed to mark useful upward DFS traversal). Evaluated in cycle-level simulation on Vulkan-sim 2.0, TTP reports 1.48× average speedup (max 1.89×), 98.92 % L1 accuracy, 31.54 % coverage, and negligible hardware overhead relative to a baseline without prefetching.
Significance. If the simulation results hold, the work demonstrates a low-cost, high-accuracy prefetching technique that exploits an existing RT hardware structure (the traversal stack) rather than adding new prediction tables. The reported correlation between coverage and speedup, together with the near-negligible area cost, would be a useful contribution to memory-latency mitigation in dedicated ray-tracing hardware.
major comments (2)
- [Evaluation] Evaluation section: all speedup, accuracy, and coverage numbers are obtained from Vulkan-sim 2.0 runs; the manuscript provides no cross-validation against real RT cores, alternative simulators, or sensitivity analysis to stack-pop timing and memory latency parameters. Because the prefetch logic is triggered specifically by consecutive stack pops, any mismatch in modeled stack behavior directly undermines the 1.48× speedup and 98.92 % accuracy claims.
- [Evaluation] The coverage metric (31.54 %) is defined as L1-miss reduction relative to baseline, yet the text does not quantify how much of the observed speedup is attributable to latency hiding versus other simulator effects (e.g., changes in thread scheduling). This link is load-bearing for the central performance claim.
minor comments (2)
- [Abstract] The abstract states 'nearly negligible hardware overhead' without quoting concrete area or power numbers; these figures should appear in the main text or a table.
- [Abstract] Benchmark scenes, triangle counts, and exact baseline configuration (cache sizes, memory latency, etc.) are not summarized in the abstract; a short table or sentence would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment on the evaluation methodology below, providing clarifications and committing to revisions where appropriate to strengthen the paper.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: all speedup, accuracy, and coverage numbers are obtained from Vulkan-sim 2.0 runs; the manuscript provides no cross-validation against real RT cores, alternative simulators, or sensitivity analysis to stack-pop timing and memory latency parameters. Because the prefetch logic is triggered specifically by consecutive stack pops, any mismatch in modeled stack behavior directly undermines the 1.48× speedup and 98.92 % accuracy claims.
Authors: We agree that additional validation would be valuable. Cross-validation on real RT cores is not feasible in this work, as commercial ray-tracing hardware implementations and their detailed microarchitectural models are proprietary. Vulkan-sim 2.0 is a cycle-accurate simulator specifically designed for Vulkan and ray-tracing workloads and is commonly used in the literature for such studies. To address the concern about sensitivity to modeling assumptions, we will add a new subsection in the revised manuscript presenting sensitivity analysis on memory latency parameters and variations in the timing of consecutive stack pops. revision: partial
-
Referee: [Evaluation] The coverage metric (31.54 %) is defined as L1-miss reduction relative to baseline, yet the text does not quantify how much of the observed speedup is attributable to latency hiding versus other simulator effects (e.g., changes in thread scheduling). This link is load-bearing for the central performance claim.
Authors: We acknowledge that a more explicit breakdown would improve the clarity of the performance results. The reported correlation between coverage and speedup, combined with the very high L1 accuracy, indicates that the gains stem primarily from reduced memory latency. In the revision we will add execution-time breakdowns and memory-stall-cycle statistics to better quantify the contribution of latency hiding versus other simulator effects such as scheduling. revision: yes
- Direct validation against real commercial RT cores or alternative closed-source simulators, as we lack access to proprietary hardware models.
Circularity Check
No circularity: TTP design evaluated via independent simulation measurements
full rationale
The paper proposes TTP as a hardware prefetcher that triggers on consecutive pops from the existing RT traversal stack to prefetch BVH nodes during upward DFS traversal. All reported results (1.48x average speedup, 98.92% L1 accuracy defined as prefetched blocks later referenced, 31.54% coverage as miss reduction ratio) are measured outcomes from separate cycle-level simulation runs on Vulkan-sim 2.0. These quantities are not algebraically or definitionally forced by the prefetch rule itself; they are empirical outputs. No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes smuggled via prior work appear in the derivation. The design is self-contained; simulator fidelity is an external validity assumption, not an internal circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Ray tracing hardware performs DFS-based BVH traversal and maintains a per-thread traversal stack of node addresses.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate TTP on a cycle-level simulator, Vulkan-sim 2.0, and show that it achieves 1.48x speedup
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Code repo for Treelet Prefetching For Ray Tracing (MICRO 2023)
“Code repo for Treelet Prefetching For Ray Tracing (MICRO 2023).” [Online]. Available: https://github.com/ubc-aamodt-group/ treelet-prefetching-for-rt
work page 2023
-
[2]
DirectX Raytracing (DXR) Functional Spec
“DirectX Raytracing (DXR) Functional Spec.” [Online]. Available: https://microsoft.github.io/DirectX-Specs/d3d/Raytracing.html
- [3]
-
[4]
Intel® Arc™ Graphics Developer Guide for Real-Time Ray Tracing in
“Intel® Arc™ Graphics Developer Guide for Real-Time Ray Tracing in...” [Online]. Available: https://www.intel.com/content/www/us/en/ developer/articles/guide/real-time-ray-tracing-in-games.html
-
[5]
“NVIDIA ADA GPU ARCHITECTURE.” [Online]. Avail- able: https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia- ada-gpu-architecture.pdf
-
[6]
NVIDIA AMPERE GA102 GPU ARCHITECTURE
“NVIDIA AMPERE GA102 GPU ARCHITECTURE.” [Online]. Available: https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102- gpu-architecture-whitepaper-v2.pdf
-
[7]
“Real-Time Ray Tracing.” [Online]. Available: https: //dev.epicgames.com/documentation/en-us/unreal-engine/hardware- ray-tracing-tips-and-tricks-in-unreal-engine
-
[8]
Real-time Raytracing for Interactive Global Illumination Workflows in Frostbite
“Real-time Raytracing for Interactive Global Illumination Workflows in Frostbite.” [Online]. Available: https://www.gdcvault.com/play/1024801/
-
[9]
The Stanford 3D Scanning Repository
“The Stanford 3D Scanning Repository.” [Online]. Available: https: //graphics.stanford.edu/data/3Dscanrep/
-
[10]
“Unity real-time Ray Tracing.” [Online]. Available: https://unity.com/ ray-tracing
-
[11]
Architecture considerations for tracing incoherent rays,
T. Aila and T. Karras, “Architecture considerations for tracing incoherent rays,” inProceedings of the Conference on High Performance Graphics, ser. HPG ’10. Goslar, DEU: Eurographics Association, Jun. 2010, pp. 113–122
work page 2010
-
[12]
Understanding the efficiency of ray traversal on GPUs,
T. Aila and S. Laine, “Understanding the efficiency of ray traversal on GPUs,” inProceedings of the Conference on High Performance Graphics 2009, ser. HPG ’09. New York, NY , USA: Association for Computing Machinery, Aug. 2009, pp. 145–149. [Online]. Available: https://doi.org/10.1145/1572769.1572792
-
[13]
Graph Prefetching Using Data Structure Knowledge,
S. Ainsworth and T. M. Jones, “Graph Prefetching Using Data Structure Knowledge,” inProceedings of the 2016 International Conference on Supercomputing, ser. ICS ’16. New York, NY , USA: Association for Computing Machinery, Jun. 2016, pp. 1–11. [Online]. Available: https://doi.org/10.1145/2925426.2926254
-
[14]
Extending GPU Ray-Tracing Units for Hierarchical Search Acceleration,
A. Barnes, F. Shen, and T. G. Rogers, “Extending GPU Ray-Tracing Units for Hierarchical Search Acceleration,” inProceedings of the 57th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’24. New York, NY , USA: Association for Computing Machinery, Nov. 2024
work page 2024
-
[15]
Parallel breadth-first search on distributed memory systems,
A. Buluc ¸ and K. Madduri, “Parallel breadth-first search on distributed memory systems,” inProceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’11. New York, NY , USA: Association for Computing Machinery, Nov. 2011, pp. 1–12. [Online]. Available: https://dl.acm.org/doi/10. 1145/2063384.2063471
-
[16]
J. Burgess, “RTX on—The NVIDIA Turing GPU,”IEEE Micro, vol. 40, no. 2, pp. 36–44, Mar. 2020, conference Name: IEEE Micro. [Online]. Available: https://ieeexplore.ieee.org/document/8981896
-
[17]
What’s the Difference Between Ray Tracing and Rasterization?
B. Caulfield, “What’s the Difference Between Ray Tracing and Rasterization?” Mar. 2018. [Online]. Available: https://blogs.nvidia. com/blog/whats-difference-between-ray-tracing-rasterization/
work page 2018
-
[18]
Treelet Accelerated Ray Tracing on GPUs,
Y . H. Chou and T. M. Aamodt, “Treelet Accelerated Ray Tracing on GPUs,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’25. New York, NY , USA: Association for Computing Machinery, Mar. 2025, pp. 1334–1347. [Online]. Available: https://dl.acm.org/doi/1...
-
[19]
Treelet Prefetching For Ray Tracing,
Y . H. Chou, T. Nowicki, and T. M. Aamodt, “Treelet Prefetching For Ray Tracing,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY , USA: Association for Computing Machinery, Dec. 2023, pp. 742–755
work page 2023
-
[20]
Ray Tracing for the Movie ‘Cars’,
P. H. Christensen, J. Fong, D. M. Laur, and D. Batali, “Ray Tracing for the Movie ‘Cars’,” in2006 IEEE Symposium on Interactive Ray Tracing, Sep. 2006, pp. 1–6. [Online]. Available: https://ieeexplore.ieee.org/document/4061539
-
[21]
Toward Real-Time Ray Tracing: A Survey on Hardware Acceleration and Microarchitecture Techniques,
Y . Deng, Y . Ni, Z. Li, S. Mu, and W. Zhang, “Toward Real-Time Ray Tracing: A Survey on Hardware Acceleration and Microarchitecture Techniques,”ACM Computing Surveys, vol. 50, no. 4, pp. 58:1–58:41, Aug. 2017. [Online]. Available: https://doi.org/10.1145/3104067
-
[22]
Y . Feng, Y . Li, J. Lee, W. W. Ro, and H. Jeon, “Heliostat: Harnessing Ray Tracing Accelerators for Page Table Walks,” in Proceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25. New York, NY , USA: Association for Computing Machinery, Jun. 2025, pp. 122–136. [Online]. Available: https://dl.acm.org/doi/10.1145/36950...
-
[23]
Stride Directed Prefetching In Scalar Processors,
J. Fu, J. Patel, and B. Janssens, “Stride Directed Prefetching In Scalar Processors,” in[1992] Proceedings the 25th Annual International Symposium on Microarchitecture MICRO 25, Dec. 1992, pp. 102–110. [Online]. Available: https://ieeexplore.ieee.org/document/697004
work page 1992
-
[24]
LibRTS: A Spatial Indexing Library by Ray Tracing,
L. Geng, R. Lee, and X. Zhang, “LibRTS: A Spatial Indexing Library by Ray Tracing,” inProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’25. New York, NY , USA: Association for Computing Machinery, Feb. 2025, pp. 396–411. [Online]. Available: https://dl.acm.org/doi/10.1145/3710848.3710850
-
[25]
Realtime Ray Tracing on GPU with BVH-based Packet Traversal,
J. Gunther, S. Popov, H.-P. Seidel, and P. Slusallek, “Realtime Ray Tracing on GPU with BVH-based Packet Traversal,” in2007 IEEE Symposium on Interactive Ray Tracing, Sep. 2007, pp. 113–118. [Online]. Available: https://ieeexplore.ieee.org/document/4342598
-
[26]
Generalizing Ray Tracing Accelerators for Tree Traversals on GPUs,
D. Ha, L. Liu, Y . H. Chou, S. Go, W. W. Ro, H.-W. Tseng, and T. M. Aamodt, “Generalizing Ray Tracing Accelerators for Tree Traversals on GPUs,” inProceedings of the 57th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’24. New York, NY , USA: Association for Computing Machinery, Nov. 2024
work page 2024
-
[27]
N. Jouppi, “Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers,” in
-
[28]
The 17th Annual International Symposium on Computer Architecture, May 1990, pp
Proceedings. The 17th Annual International Symposium on Computer Architecture, May 1990, pp. 364–373. [Online]. Available: https://ieeexplore.ieee.org/document/134547
work page 1990
-
[29]
Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling,
M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), May 2020, pp. 473–486. [Online]. Available: https://ieeexplore.ieee.org/document/9138922
-
[30]
Many- Thread Aware Prefetching Mechanisms for GPGPU Applications,
J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc, “Many- Thread Aware Prefetching Mechanisms for GPGPU Applications,” in2010 43rd Annual IEEE/ACM International Symposium on 13 Microarchitecture, Dec. 2010, pp. 213–224, iSSN: 2379-3155. [Online]. Available: https://ieeexplore.ieee.org/document/5695538
-
[31]
GPUWattch: enabling energy optimizations in GPGPUs,
J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V . J. Reddi, “GPUWattch: enabling energy optimizations in GPGPUs,”ACM SIGARCH Computer Architecture News, vol. 41, no. 3, pp. 487–498, Jun. 2013. [Online]. Available: https://doi.org/10.1145/2508148.2485964
-
[32]
Intersection Prediction for Accelerated GPU Ray Tracing,
L. Liu, W. Chang, F. Demoullin, Y . H. Chou, M. Saed, D. Pankratz, T. Nowicki, and T. M. Aamodt, “Intersection Prediction for Accelerated GPU Ray Tracing,” inMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’21. New York, NY , USA: Association for Computing Machinery, Oct. 2021, pp. 709–723. [Online]. Available: http...
-
[33]
LumiBench: A Benchmark Suite for Hardware Ray Tracing,
L. Liu, M. Saed, Y . H. Chou, D. Grigoryan, T. Nowicki, and T. M. Aamodt, “LumiBench: A Benchmark Suite for Hardware Ray Tracing,” in2023 IEEE International Symposium on Workload Characterization (IISWC), Oct. 2023, pp. 1–14, iSSN: 2835-2238. [Online]. Available: https://ieeexplore.ieee.org/document/10289559
-
[34]
P. Liu, J. Yu, and M. C. Huang, “Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads,”ACM Trans. Archit. Code Optim., vol. 13, no. 1, pp. 13:1– 13:25, Mar. 2016. [Online]. Available: https://doi.org/10.1145/2890505
-
[35]
An effective GPU implementation of breadth-first search,
L. Luo, M. Wong, and W.-m. Hwu, “An effective GPU implementation of breadth-first search,” inProceedings of the 47th Design Automation Conference, ser. DAC ’10. New York, NY , USA: Association for Computing Machinery, Jun. 2010, pp. 52–55. [Online]. Available: https://dl.acm.org/doi/10.1145/1837274.1837289
-
[36]
Arkade: k-Nearest Neighbor Search With Non-Euclidean Distances using GPU Ray Tracing,
D. K. Mandarapu, V . Nagarajan, A. Pelenitsyn, and M. Kulkarni, “Arkade: k-Nearest Neighbor Search With Non-Euclidean Distances using GPU Ray Tracing,” inProceedings of the 38th ACM International Conference on Supercomputing, ser. ICS ’24. New York, NY , USA: Association for Computing Machinery, Jun. 2024, pp. 14–25. [Online]. Available: https://dl.acm.or...
-
[37]
D. Merrill, M. Garland, and A. Grimshaw, “Scalable GPU graph traversal,”SIGPLAN Not., vol. 47, no. 8, pp. 117–128, Feb. 2012. [Online]. Available: https://doi.org/10.1145/2370036.2145832
-
[38]
Data Cache Prefetching Using a Global History Buffer,
K. Nesbit and J. Smith, “Data Cache Prefetching Using a Global History Buffer,” in10th International Symposium on High Performance Computer Architecture (HPCA’04), Feb. 2004, pp. 96–96, iSSN: 1530-
work page 2004
-
[39]
Available: https://ieeexplore.ieee.org/document/1410068
[Online]. Available: https://ieeexplore.ieee.org/document/1410068
-
[40]
Node pre-fetching ar- chitecture for real-time ray tracing,
J.-s. Park, W.-c. Park, J.-H. Nah, and T.-d. Han, “Node pre-fetching ar- chitecture for real-time ray tracing,”IEICE Electronics Express, vol. 10, no. 14, pp. 20 130 468–20 130 468, 2013
work page 2013
-
[41]
Vulkan- Sim: A GPU Architecture Simulator for Ray Tracing,
M. Saed, Y . H. Chou, L. Liu, T. Nowicki, and T. M. Aamodt, “Vulkan- Sim: A GPU Architecture Simulator for Ray Tracing,” in2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct. 2022, pp. 263–281. [Online]. Available: https://ieeexplore.ieee.org/ document/9923844/citations?tabFilter=papers#citations
-
[42]
RayN: Ray Tracing Acceleration with Near-memory Computing,
M. Saed, P. J. Nair, and T. M. Aamodt, “RayN: Ray Tracing Acceleration with Near-memory Computing,” inProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’25. New York, NY , USA: Association for Computing Machinery, Oct. 2025, pp. 277–291. [Online]. Available: https: //dl.acm.org/doi/10.1145/3725843.3756067
-
[43]
FreePDK: An Open-Source Variation-Aware Design Kit,
J. E. Stine, I. Castellanos, M. Wood, J. Henson, F. Love, W. R. Davis, P. D. Franzon, M. Bucher, S. Basavarajaiah, J. Oh, and R. Jenkal, “FreePDK: An Open-Source Variation-Aware Design Kit,” in2007 IEEE International Conference on Microelectronic Systems Education (MSE’07), Jun. 2007, pp. 173–174. [Online]. Available: https://ieeexplore.ieee.org/document/4231502
-
[44]
Home|Vulkan|Cross platform 3D Graphics,
Vulkan, “Home|Vulkan|Cross platform 3D Graphics,” May 2024. [Online]. Available: https://vulkan.org/
work page 2024
-
[45]
IMP: indirect memory prefetcher,
X. Yu, C. J. Hughes, N. Satish, and S. Devadas, “IMP: indirect memory prefetcher,” inProceedings of the 48th International Symposium on Microarchitecture, ser. MICRO-48. New York, NY , USA: Association for Computing Machinery, Dec. 2015, pp. 178–190. [Online]. Available: https://doi.org/10.1145/2830772.2830807
-
[46]
Drex: Accurate and scalable dense retrieval acceleration via algorithmic-hardware codesign,
H. Zhang, Y . Zhang, and H.-W. Tseng, “RTSpMSpM: Harnessing Ray Tracing for Efficient Sparse Matrix Computations,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25. New York, NY , USA: Association for Computing Machinery, Jun. 2025, pp. 359–373. [Online]. Available: https: //dl.acm.org/doi/10.1145/3695053.3731072 14
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.