{"paper":{"title":"TTP: A Hardware-Efficient Design for Precise Prefetching in Ray Tracing","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"A prefetcher that monitors consecutive pops from ray tracing traversal stacks delivers 1.48x average speedup with negligible hardware overhead.","cross_cats":[],"primary_cat":"cs.AR","authors_text":"Anshul Naithani, Huiyang Zhou, Yavuz Selim Tozlu","submitted_at":"2026-05-15T17:57:31Z","abstract_excerpt":"Ray tracing (RT) is a 3D graphics technique that offers highly realistic visuals. It is becoming prominent and accessible as GPU vendors have integrated dedicated ray tracing acceleration hardware. However, tracing millions of rays through 3D scenes consisting of high numbers of triangles in real time is challenging and requires expensive hardware. The main bottleneck in RT workloads is the expensive Bounding Volume Hierarchy (BVH) traversal task, which is a large tree structure that encodes the 3D scene. BVH traversal is a memory-bound problem, as the GPU threads spend most of their time read"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We propose a novel hardware prefetcher, named Tree Traversal Prefetcher (TTP), for ray tracing. ... TTP achieves 1.48x speedup on average (up to 1.89x) compared to the baseline, with nearly negligible hardware overhead. TTP achieves 98.92% average L1 accuracy, which is the ratio of the prefetched blocks being actually referenced by demand loads.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The cycle-level simulator Vulkan-sim 2.0 accurately reproduces the memory access patterns and traversal stack behavior of real ray tracing hardware, and that consecutive stack pops reliably indicate useful upward traversal for prefetching without causing cache pollution.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"TTP is a hardware prefetcher for ray tracing that leverages traversal stack addresses during DFS to prefetch BVH nodes, achieving 1.48x average speedup and 98.92% L1 accuracy in cycle-level simulations.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A prefetcher that monitors consecutive pops from ray tracing traversal stacks delivers 1.48x average speedup with negligible hardware overhead.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"de1302ceaae9101d443b12b2dce413ead090016401b92661316ea2b8ae354bcf"},"source":{"id":"2605.16253","kind":"arxiv","version":1},"verdict":{"id":"56561056-afe2-4609-8a12-fc5ffc16c4bc","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T18:11:46.288479Z","strongest_claim":"We propose a novel hardware prefetcher, named Tree Traversal Prefetcher (TTP), for ray tracing. ... TTP achieves 1.48x speedup on average (up to 1.89x) compared to the baseline, with nearly negligible hardware overhead. TTP achieves 98.92% average L1 accuracy, which is the ratio of the prefetched blocks being actually referenced by demand loads.","one_line_summary":"TTP is a hardware prefetcher for ray tracing that leverages traversal stack addresses during DFS to prefetch BVH nodes, achieving 1.48x average speedup and 98.92% L1 accuracy in cycle-level simulations.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The cycle-level simulator Vulkan-sim 2.0 accurately reproduces the memory access patterns and traversal stack behavior of real ray tracing hardware, and that consecutive stack pops reliably indicate useful upward traversal for prefetching without causing cache pollution.","pith_extraction_headline":"A prefetcher that monitors consecutive pops from ray tracing traversal stacks delivers 1.48x average speedup with negligible hardware overhead."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.16253/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"doi_title_agreement","ran_at":"2026-05-19T18:31:18.695492Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T18:21:10.404578Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"shingle_duplication","ran_at":"2026-05-19T17:49:42.172704Z","status":"skipped","version":"0.1.0","findings_count":0},{"name":"citation_quote_validity","ran_at":"2026-05-19T17:49:41.785024Z","status":"skipped","version":"0.1.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-19T17:33:23.077356Z","status":"skipped","version":"1.0.0","findings_count":0},{"name":"external_links","ran_at":"2026-05-19T17:31:23.733165Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"claim_evidence","ran_at":"2026-05-19T17:01:55.594408Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"cited_work_retraction","ran_at":"2026-05-19T16:51:56.161016Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"b00690d10550dd52a566dca981f49b5dfe59251a2fac75e5b8265a0f4c9f0e03"},"references":{"count":46,"sample":[{"doi":"","year":2023,"title":"Code repo for Treelet Prefetching For Ray Tracing (MICRO 2023)","work_id":"156a72e2-aa71-4997-ac43-ef11e1c85e01","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"DirectX Raytracing (DXR) Functional Spec","work_id":"07088007-f1fa-44a1-a0f0-a0eb7764bc88","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Intel Embree","work_id":"76c3a450-f00d-4cb3-a874-f6050ef9b16a","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Intel® Arc™ Graphics Developer Guide for Real-Time Ray Tracing in","work_id":"68032a90-4028-46e8-8c94-0bb7e963e362","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"NVIDIA ADA GPU ARCHITECTURE","work_id":"01be0015-0d13-4396-a23f-28a27abe9c5c","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":46,"snapshot_sha256":"076265b1df3e6d34ea9833e0a2dfb14b6fb388cb943cee4e2608bb8095fdf07d","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"1e79e494acdc80837bda816a58000e2132cbc49a4e4ba6abf58d1881dea0334d"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}