Accelerating 3D Gaussian Splatting using Tensor Cores

Bo Yuan; Sheng Li; Xulong Tang; Yang Sui; Yue Dai; Yue Wu; Zhuoran Song

arxiv: 2605.17855 · v1 · pith:NQMLAYHHnew · submitted 2026-05-18 · 💻 cs.GR

Accelerating 3D Gaussian Splatting using Tensor Cores

Sheng Li , Yang Sui , Yue Wu , Zhuoran Song , Bo Yuan , Xulong Tang , Yue Dai This is my paper

Pith reviewed 2026-05-20 00:41 UTC · model grok-4.3

classification 💻 cs.GR

keywords 3D Gaussian SplattingTensor CoresrasterizationGPU accelerationneural renderingreal-time renderinglow-precision computation

0 comments

The pith

TensorGS accelerates 3D Gaussian Splatting rendering 1.65 times by mapping rasterization to Tensor Core matrix operations and grouping Gaussians across tiles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that the compute-heavy rasterization stage in 3D Gaussian Splatting can run much faster on current GPUs. It converts the scattered per-pixel Gaussian contributions into dense matrix multiplications that Tensor Cores can execute efficiently and adds cross-tile grouping so that the same Gaussian data is reused by neighboring screen regions. This change yields a measured 1.65 times speedup for the full rendering pipeline while the final images stay visually the same. A reader would care because 3D Gaussian Splatting is already a leading method for real-time neural rendering, yet its speed has limited its use in latency-critical settings.

Core claim

TensorGS tensorizes the dominant rasterization computation into Tensor-Core-compatible matrix operations and introduces cross-tile grouping to improve Gaussian reuse, amortize overhead, and increase Tensor Core utilization. Experimental results show that TensorGS improves end-to-end rendering performance by 1.65× while preserving image quality.

What carries the argument

Tensorization of per-Gaussian rasterization into matrix multiplications together with cross-tile grouping that reuses Gaussian data across neighboring screen tiles.

If this is right

Tensor Cores that previously sat idle during 3D Gaussian Splatting can now contribute to the dominant compute stage.
Repeated loading of the same Gaussian attributes drops because neighboring tiles share grouped work.
End-to-end latency falls enough to open 3D Gaussian Splatting for more interactive or embedded applications.
The same FP16 path keeps visual fidelity, so no separate high-precision fallback is required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same matrix-reformulation pattern might apply to other graphics workloads whose per-pixel work is currently scattered.
Cross-tile grouping could be tested on different tile sizes or on non-square screen partitions to find the reuse sweet spot.
If the matrix mapping proves stable, similar low-precision hardware units on future chips could be targeted without rewriting the core algorithm.

Load-bearing premise

That 3D Gaussian Splatting rendering stays acceptable when run in 16-bit floating point and that the irregular per-pixel calculations can be reorganized into dense regular matrix workloads without large accuracy loss or extra data movement cost.

What would settle it

Side-by-side rendering of the same scene with the original CUDA implementation and with TensorGS, followed by PSNR or SSIM measurement below an acceptable threshold, or a frame-rate measurement showing no net speedup.

Figures

Figures reproduced from arXiv: 2605.17855 by Bo Yuan, Sheng Li, Xulong Tang, Yang Sui, Yue Dai, Yue Wu, Zhuoran Song.

**Figure 1.** Figure 1: (a) Averaged end-to-end time breakdown of the 3DGS rendering pipeline across six representative scenes. (b) Averaged time breakdown within the rasterization stage. evaluation of power𝑖 as the power computation, since it is the dominant arithmetic operation in rasterization. 3 3DGS Pipeline Analysis and Motivation 3.1 Rasterization Bottleneck and Underutilized Tensor Cores We begin by characterizing where t… view at source ↗

**Figure 2.** Figure 2: Overview of the original 3DGS pipeline and our proposed TensorGS. the same Gaussian data must be fetched and processed repeatedly across nearby tiles, preventing effective reuse and often producing small per-tile workloads. To address this issue, TensorGS groups neighboring tiles into a larger processing region so that overlapping Gaussians can be reused across multiple tiles, reducing the cost of Gaussi… view at source ↗

**Figure 3.** Figure 3: Gaussian loading reduction within the default 2×2 tile group across six scenes. suggest that the per-tile execution granularity is still inefficient for Tensor Core execution. To identify a better execution granularity, we revisit how rasterization is organized in the original 3DGS pipeline. There, each Gaussian is instantiated according to the tiles it overlaps, and each tile is processed independently… view at source ↗

**Figure 4.** Figure 4: Speedup over the original 3DGS pipeline on six representative scenes. 0.0 0.5 1.0 1.5 2.0 gsplat AdR-Gaussian FlashGS GSCore Speedup Original + TensorGS [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Speedup (averaged over six scenes) when integrating TensorGS into the four representative 3DGS optimization methods. 5.2 Main Results Figures 4 and 5 show the main evaluation results from two perspectives. First, [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

3D Gaussian Splatting (3DGS) has become a leading technique for real-time neural rendering and 3D scene reconstruction, but its rendering cost remains too high for many latency-sensitive scenarios. In particular, the rasterization stage in 3DGS dominates end-to-end rendering time, during which the renderer repeatedly evaluates each Gaussian's contribution to each covered pixel, making this stage compute-bound. At the same time, modern GPUs provide high-throughput Tensor Cores for low-precision matrix operations, yet existing 3DGS systems execute rasterization entirely on CUDA cores and leave Tensor Cores idle. We find that 3DGS rendering can be executed in FP16 with negligible quality degradation, suggesting a promising opportunity for Tensor Core acceleration. However, exploiting Tensor Cores for 3DGS is non-trivial because rasterization does not naturally match their execution model. Existing 3DGS rasterization is expressed as irregular per-pixel scalar operations, whereas Tensor Cores require dense, regular, and reuse-rich matrix workloads. Moreover, conventional tile-by-tile execution fails to exploit Gaussian reuse across neighboring tiles, resulting in repeated data loading and thus high data movement overhead. To this end, we present TensorGS, a 3DGS acceleration framework using Tensor Cores. TensorGS tensorizes the dominant rasterization computation into Tensor-Core-compatible matrix operations and introduces cross-tile grouping to improve Gaussian reuse, amortize overhead, and increase Tensor Core utilization. Experimental results show that TensorGS improves end-to-end rendering performance by 1.65$\times$ while preserving image quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TensorGS maps 3DGS rasterization to Tensor Cores with cross-tile grouping for a measured 1.65x speedup, but the reformulation of irregular blending steps needs tighter validation.

read the letter

The key takeaway is that this work shows how to repurpose Tensor Cores for the dominant rasterization pass in 3D Gaussian Splatting by recasting per-Gaussian contributions as matrix operations and adding cross-tile grouping to cut redundant loads. That combination delivers the reported 1.65x end-to-end gain while the authors claim image metrics stay intact. It is a concrete hardware-aware tweak rather than a new algorithm for the splatting itself. They correctly note that existing CUDA-only implementations leave Tensor Cores idle and that FP16 appears tolerable for this workload, which opens a practical path for latency-sensitive uses like VR. The cross-tile grouping is a sensible addition to improve reuse and utilization. The main soft spot is the translation of depth-sorted alpha blending and per-pixel weighting into dense matrix form. Any padding, reordering, or approximation required to fit the irregular operations can introduce small divergences that standard PSNR and SSIM might miss, especially in dense overlap regions. The abstract gives limited experimental detail on baselines, scenes, and how FP16 conversion was checked, so the quality claim rests on fairly thin evidence at this stage. The stress-test concern about ordering violations is reasonable to raise until the full derivation and per-pixel error bounds are shown. This paper is aimed at graphics engineers and hardware optimizers who already run 3DGS and want to squeeze more performance from current GPUs. A reader focused on real-time rendering pipelines would find the implementation approach useful even if they end up adapting rather than copying it. It is worth sending to peer review because the performance result is externally measurable and the hardware mapping is new for this technique, though the authors should expect questions on the exact matrix reformulation and stronger quality diagnostics.

Referee Report

3 major / 2 minor

Summary. The paper introduces TensorGS, a framework to accelerate 3D Gaussian Splatting (3DGS) rendering by exploiting GPU Tensor Cores. It tensorizes the dominant rasterization stage—previously expressed as irregular per-pixel scalar operations—into dense, regular matrix operations compatible with Tensor Cores, and adds cross-tile grouping to improve Gaussian reuse across neighboring tiles and reduce data movement. The central claim is that this yields a 1.65× end-to-end rendering speedup while preserving image quality, based on the observation that 3DGS can run in FP16 with negligible degradation.

Significance. If the performance and quality claims hold under rigorous validation, the work would be significant for real-time neural rendering: it demonstrates a practical way to utilize otherwise-idle Tensor Cores for a compute-bound stage in a leading technique, potentially lowering latency in applications such as AR/VR and robotics. The cross-tile grouping idea addresses a genuine data-reuse opportunity not exploited by conventional tile-based 3DGS pipelines. The manuscript does not yet provide machine-checked proofs or fully reproducible artifacts, but the hardware-aware reformulation itself is a concrete, falsifiable contribution.

major comments (3)

[Abstract] Abstract: The reported 1.65× end-to-end speedup and “preserved image quality” rest on limited evidence; the abstract provides no baseline implementation details, number or identity of test scenes, error bars, or quantitative validation that FP16 conversion and matrix reformulation incur negligible PSNR/SSIM loss. This directly undermines assessment of the central performance claim.
[Tensorization of Rasterization] Tensorization section (around the description of matrix reformulation for Gaussian weights and alpha blending): Re-expressing per-pixel depth-sorted accumulation and blending as batched dense matrix multiplies necessarily involves padding, reordering, or approximation; the manuscript should demonstrate that the resulting per-pixel results match the original CUDA implementation up to FP16 rounding, or bound any divergence in high-overlap regions, because ordering violations would invalidate the quality-preservation assertion even if aggregate metrics pass.
[Cross-Tile Grouping] Cross-tile grouping description: While the reuse optimization is conceptually sound, the paper must quantify the additional data-movement and synchronization overhead introduced by grouping across tiles and show that it does not offset the Tensor Core gains on the target hardware; without these measurements the net 1.65× figure cannot be confidently attributed to the proposed techniques.

minor comments (2)

[Abstract] The abstract introduces the acronym “TensorGS” without an explicit definition on first use; a parenthetical expansion would improve readability.
[Experimental Results] Figure captions and axis labels in the experimental section would benefit from explicit mention of the hardware platform (e.g., specific GPU model and Tensor Core generation) to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have prepared point-by-point responses to the major comments and will revise the paper to address the concerns by adding requested details, validations, and measurements.

read point-by-point responses

Referee: [Abstract] Abstract: The reported 1.65× end-to-end speedup and “preserved image quality” rest on limited evidence; the abstract provides no baseline implementation details, number or identity of test scenes, error bars, or quantitative validation that FP16 conversion and matrix reformulation incur negligible PSNR/SSIM loss. This directly undermines assessment of the central performance claim.

Authors: We agree that the abstract would benefit from additional context to support the central claims. In the revised version, we will expand the abstract to reference the original 3DGS CUDA implementation as the baseline, specify the test scenes drawn from standard datasets such as Mip-NeRF 360, and note that quantitative PSNR/SSIM results with error bars appear in the experiments section. This will provide clearer substantiation for the reported speedup and quality preservation without altering the abstract's length substantially. revision: yes
Referee: [Tensorization of Rasterization] Tensorization section (around the description of matrix reformulation for Gaussian weights and alpha blending): Re-expressing per-pixel depth-sorted accumulation and blending as batched dense matrix multiplies necessarily involves padding, reordering, or approximation; the manuscript should demonstrate that the resulting per-pixel results match the original CUDA implementation up to FP16 rounding, or bound any divergence in high-overlap regions, because ordering violations would invalidate the quality-preservation assertion even if aggregate metrics pass.

Authors: We acknowledge the value of finer-grained validation beyond aggregate metrics. The current manuscript supports quality preservation primarily through overall PSNR and SSIM comparisons showing negligible degradation for the FP16 tensorized version versus the FP32 CUDA baseline. To directly address potential ordering or divergence issues, we will add per-pixel difference analysis and visualizations in the revised manuscript, along with explicit bounds on divergence in high-overlap regions. Depth sorting will be shown to be preserved through a dedicated preprocessing step before the matrix operations, confirming that results align within FP16 rounding tolerances. revision: yes
Referee: [Cross-Tile Grouping] Cross-tile grouping description: While the reuse optimization is conceptually sound, the paper must quantify the additional data-movement and synchronization overhead introduced by grouping across tiles and show that it does not offset the Tensor Core gains on the target hardware; without these measurements the net 1.65× figure cannot be confidently attributed to the proposed techniques.

Authors: This is a fair request for a more complete performance attribution. While the manuscript reports the overall 1.65× end-to-end speedup measured on the target hardware, it does not separately profile the data-movement and synchronization overheads specific to cross-tile grouping. We will revise the experimental evaluation to include these measurements, showing that the overhead is amortized by improved Gaussian reuse and higher Tensor Core utilization, thereby confirming that the net speedup is attributable to the proposed techniques. revision: yes

Circularity Check

0 steps flagged

No circularity: measured speedups from implementation, not self-referential derivation

full rationale

The paper presents an engineering acceleration framework for 3DGS rasterization by reformulating it as Tensor-Core matrix operations plus cross-tile grouping. All reported results (1.65× end-to-end speedup, preserved PSNR/SSIM) are obtained from direct hardware benchmarks against baseline CUDA implementations. No load-bearing step reduces to a fitted parameter, self-citation chain, or equation that is definitionally equivalent to its inputs. The central claims rest on external GPU performance measurements and standard image-quality metrics rather than any internal derivation that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly relies on standard assumptions about GPU Tensor Core availability and FP16 numerical tolerance in graphics workloads.

pith-pipeline@v0.9.0 · 5827 in / 1087 out tokens · 41578 ms · 2026-05-20T00:41:15.258425+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. 2021. Nvidia a100 tensor core gpu: Performance and innovation.IEEE Micro41, 2 (2021), 29–35

work page 2021
[2]

Guofeng Feng, Siyan Chen, Rong Fu, Zimu Liao, Yi Wang, Tao Liu, Boni Hu, Lining Xu, Zhilin Pei, Hengjie Li, Xiuhong Li, Ninghui Sun, Xingcheng Zhang, and Bo Dai. 2025. Flashgs: Efficient 3d gaussian splatting for large-scale and high-resolution rendering. InProceedings of the Computer Vision and Pattern Recognition Conference. 26652– 26662

work page 2025
[3]

Houshu He, Naifeng Jing, Li Jiang, Xiaoyao Liang, and Zhuoran Song

work page
[4]

InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1

AGS: A ccelerating 3D G aussian Splatting S LAM via CODEC- Assisted Frame Covisibility Detection. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 20–34

work page
[5]

Houshu He, Gang Li, Fangxin Liu, Li Jiang, Xiaoyao Liang, and Zhuo- ran Song. 2025. Gsarch: Breaking memory barriers in 3d gaussian splatting training via architectural support. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 366–379

work page 2025
[6]

Peter Hedman, Julien Philip, True Price, Jan-Michael Frahm, George Drettakis, and Gabriel Brostow. 2018. Deep blending for free- viewpoint image-based rendering.ACM Transactions on Graphics (ToG)37, 6 (2018), 1–15

work page 2018
[7]

Lukas Höllein, Aljaž Božič, Michael Zollhöfer, and Matthias Nießner

work page
[8]

InProceedings of the IEEE/CVF International Conference on Computer Vision

3dgs-lm: Faster gaussian-splatting optimization with levenberg- marquardt. InProceedings of the IEEE/CVF International Conference on Computer Vision. 26740–26750

work page
[9]

Xiaotong Huang, He Zhu, Tianrui Ma, Yuxiang Xiong, Fangxin Liu, Zhezhi He, Yiming Gan, Zihan Liu, Jingwen Leng, Yu Feng, and Minyi Guo. 2026. SPLATONIC: Architectural Support for 3D Gaussian Splat- ting SLAM via Sparse Processing. In2026 IEEE International Sympo- sium on High Performance Computer Architecture (HPCA). IEEE, 1–14

work page 2026
[10]

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 2023. 3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Trans. Graph.42, 4, Article 139 (July 2023), 14 pages

work page 2023
[11]

Hyunjeong Kim and In-Kwon Lee. 2024. Is 3dgs useful?: Comparing the effectiveness of recent reconstruction methods in vr. In2024 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, 71–80

work page 2024
[12]

Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. 2017. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (ToG)36, 4 (2017), 1–13

work page 2017
[13]

Donghyun Lee, Dawoon Jeong, Jae W Lee, and Hongil Yoon. 2026. GS- Scale: Unlocking Large-Scale 3D Gaussian Splatting Training via Host Offloading. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 860–875

work page 2026
[14]

Junseo Lee, Seokwon Lee, Jungi Lee, Junyong Park, and Jaewoong Sim. 2024. Gscore: Efficient radiance field rendering via architectural support for 3d gaussian splatting. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 497–511

work page 2024
[15]

Haomin Li, Yue Liang, Fangxin Liu, Bowen Zhu, Zongwu Wang, Yu Feng, Liqiang Lu, Li Jiang, and Haibing Guan. 2026. ORANGE: Ex- ploring Ockham’s Razor for Neural Rendering by Accelerating 3DGS on NPUs with GEMM-Friendly Blending and Balanced Workloads. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1–15

work page 2026
[16]

Leshu Li, Jiayin Qin, Jie Peng, Zishen Wan, Huaizhi Qu, Ye Han, Pingqing Zheng, Hongsen Zhang, Yu Cao, Tianlong Chen, and Yang (Katie) Zhao. 2025. RTGS: Real-Time 3D Gaussian Splatting SLAM via Multi-Level Redundancy Reduction. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture. 1838– 1851

work page 2025
[17]

Rongji Liao, Yuan Zhang, Wei Zhang, Lingjun Pu, Yu Guan, Yunpeng Jing, Tao Lin, and Jinyao Yan. 2025. 3DGS-enabled High-fidelity Low- cost Immersive Static 3D Video Streaming.IEEE Journal on Selected Areas in Communications(2025)

work page 2025
[18]

Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. 2020. Neural sparse voxel fields.Advances in Neural Infor- mation Processing Systems33 (2020), 15651–15663

work page 2020
[19]

Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. 2018. Nvidia tensor core programmability, perfor- mance & precision. In2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, 522–531

work page 2018
[20]

Vitor Pereira Matias, Daniel Perazzo, Vinicius Silva, Alberto Raposo, Luiz Velho, Afonso Paiva, and Tiago Novello. 2025. From volume rendering to 3d gaussian splatting: Theory and applications. In2025 38th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI). IEEE, 1–6

work page 2025
[21]

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields for view synthesis.Commun. ACM 65, 1 (2021), 99–106

work page 2021
[22]

Seock-Hwan Noh, Banseok Shin, Jeik Choi, Seungpyo Lee, Jaeha Kung, and Yeseong Kim. 2025. FlexNeRFer: A Multi-Dataflow, Adaptive Sparsity-Aware Accelerator for On-Device NeRF Rendering. InPro- ceedings of the 52nd Annual International Symposium on Computer Architecture. 1894–1909

work page 2025
[23]

Changhun Oh, Seongryong Oh, Jinwoo Hwang, Yoonsung Kim, Hardik Sharma, and Jongse Park. 2026. Neo: Real-Time On-Device 3D Gauss- ian Splatting with Reuse-and-Update Sorting Acceleration. InPro- ceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 1268–1284

work page 2026
[24]

Minnan Pei, Gang Li, Junwen Si, Zeyu Zhu, Zitao Mo, Peisong Wang, Zhuoran Song, Xiaoyao Liang, and Jian Cheng. 2025. GCC: A 3DGS Inference Architecture with Gaussian-Wise and Cross-Stage Condi- tional Processing. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture. 1824–1837

work page 2025
[25]

Shi Qiu, Binzhu Xie, Qixuan Liu, and Pheng-Ann Heng. 2025. Ad- vancing extended reality with 3d gaussian splatting: Innovations and prospects. In2025 IEEE International Conference on Artificial Intelli- gence and eXtended and Virtual Reality (AIxVR). IEEE, 203–208

work page 2025
[26]

Santosh Reddy, H Abhiram, and KS Archish. 2025. A survey of 3D Gaussian splatting: optimization techniques, applications, and AI- driven advancements. In2025 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE). IEEE, 1–6

work page 2025
[27]

Hongyi Wang, Zhenhua Zhu, Tianchen Zhao, Yunfei Xiang, Zehao Wang, Jincheng Yu, Huazhong Yang, Yuan Xie, and Yu Wang. 2025. REACT3D: Real-time Edge Accelerator for Incremental Training in 3D Gaussian Splatting based SLAM Systems. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture. 1852–1866. 12

work page 2025
[28]

Xinzhe Wang, Ran Yi, and Lizhuang Ma. 2024. Adr-gaussian: Acceler- ating gaussian splatting with adaptive radius. InSIGGRAPH Asia 2024 Conference Papers. 1–10

work page 2024
[29]

Rui Wen, Zhifei Yue, Tianbo Liu, Xinkai Song, Jin Li, Di Huang, Jiaming Guo, Xing Hu, Zidong Du, Qi Guo, and Tianshi Chen. 2026. Cambricon- GS: An Accelerator for 3D Gaussian Splatting Training With Gaussian- Pixel Hybrid Parallelism. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1–14

work page 2026
[30]

Lizhou Wu, Haozhe Zhu, Siqi He, Jiapei Zheng, Chixiao Chen, and Xiaoyang Zeng. 2024. Gauspu: 3d gaussian splatting processor for real- time slam systems. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1562–1573

work page 2024
[31]

Yiwei Xu, Yifei Yu, Wentian Gan, Tengfei Wang, Zongqian Zhan, Hao Cheng, and Xin Wang. 2025. Gaussian on-the-fly splatting: A progressive framework for robust near real-time 3dgs optimization. IEEE Robotics and Automation Letters11, 1 (2025), 426–433

work page 2025
[32]

Vickie Ye, Ruilong Li, Justin Kerr, Matias Turkulainen, Brent Yi, Zhuoyang Pan, Otto Seiskari, Jianbo Ye, Jeffrey Hu, Matthew Tan- cik, and Angjoo Kanazawa. 2025. gsplat: An open-source library for Gaussian splatting.Journal of Machine Learning Research26, 34 (2025), 1–17

work page 2025
[33]

Zhifan Ye, Yonggan Fu, Jingqun Zhang, Leshu Li, Yongan Zhang, Sixu Li, Cheng Wan, Chenxi Wan, Chaojian Li, Sreemanth Prathipati, and Yingyan Celine Lin. 2025. Gaussian blending unit: An edge gpu plug-in for real-time gaussian-based rendering in ar/vr. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 353–365

work page 2025
[34]

Hexu Zhao, Xiwen Min, Xiaoteng Liu, Moonjun Gong, Yiming Li, Ang Li, Saining Xie, Jinyang Li, and Aurojit Panda. 2026. Clm: Removing the gpu memory barrier for 3d gaussian splatting. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 377–393

work page 2026
[35]

Zhiyu Zhou, Feng Hui, Xing Li, and Yu Liu. 2025. Visual Localization Using 3D Gaussian Splatting Representation for Mobile Robots With Geometric Feature Correspondences Synthesis.IEEE Transactions on Automation Science and Engineering(2025)

work page 2025
[36]

Siting Zhu, Guangming Wang, Xin Kong, Dezhi Kong, and Hesheng Wang. 2024. 3d gaussian splatting in robotics: A survey.arXiv preprint arXiv:2410.12262(2024). 13

work page arXiv 2024

[1] [1]

Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. 2021. Nvidia a100 tensor core gpu: Performance and innovation.IEEE Micro41, 2 (2021), 29–35

work page 2021

[2] [2]

Guofeng Feng, Siyan Chen, Rong Fu, Zimu Liao, Yi Wang, Tao Liu, Boni Hu, Lining Xu, Zhilin Pei, Hengjie Li, Xiuhong Li, Ninghui Sun, Xingcheng Zhang, and Bo Dai. 2025. Flashgs: Efficient 3d gaussian splatting for large-scale and high-resolution rendering. InProceedings of the Computer Vision and Pattern Recognition Conference. 26652– 26662

work page 2025

[3] [3]

Houshu He, Naifeng Jing, Li Jiang, Xiaoyao Liang, and Zhuoran Song

work page

[4] [4]

InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1

AGS: A ccelerating 3D G aussian Splatting S LAM via CODEC- Assisted Frame Covisibility Detection. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 20–34

work page

[5] [5]

Houshu He, Gang Li, Fangxin Liu, Li Jiang, Xiaoyao Liang, and Zhuo- ran Song. 2025. Gsarch: Breaking memory barriers in 3d gaussian splatting training via architectural support. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 366–379

work page 2025

[6] [6]

Peter Hedman, Julien Philip, True Price, Jan-Michael Frahm, George Drettakis, and Gabriel Brostow. 2018. Deep blending for free- viewpoint image-based rendering.ACM Transactions on Graphics (ToG)37, 6 (2018), 1–15

work page 2018

[7] [7]

Lukas Höllein, Aljaž Božič, Michael Zollhöfer, and Matthias Nießner

work page

[8] [8]

InProceedings of the IEEE/CVF International Conference on Computer Vision

3dgs-lm: Faster gaussian-splatting optimization with levenberg- marquardt. InProceedings of the IEEE/CVF International Conference on Computer Vision. 26740–26750

work page

[9] [9]

Xiaotong Huang, He Zhu, Tianrui Ma, Yuxiang Xiong, Fangxin Liu, Zhezhi He, Yiming Gan, Zihan Liu, Jingwen Leng, Yu Feng, and Minyi Guo. 2026. SPLATONIC: Architectural Support for 3D Gaussian Splat- ting SLAM via Sparse Processing. In2026 IEEE International Sympo- sium on High Performance Computer Architecture (HPCA). IEEE, 1–14

work page 2026

[10] [10]

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 2023. 3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Trans. Graph.42, 4, Article 139 (July 2023), 14 pages

work page 2023

[11] [11]

Hyunjeong Kim and In-Kwon Lee. 2024. Is 3dgs useful?: Comparing the effectiveness of recent reconstruction methods in vr. In2024 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, 71–80

work page 2024

[12] [12]

Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. 2017. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (ToG)36, 4 (2017), 1–13

work page 2017

[13] [13]

Donghyun Lee, Dawoon Jeong, Jae W Lee, and Hongil Yoon. 2026. GS- Scale: Unlocking Large-Scale 3D Gaussian Splatting Training via Host Offloading. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 860–875

work page 2026

[14] [14]

Junseo Lee, Seokwon Lee, Jungi Lee, Junyong Park, and Jaewoong Sim. 2024. Gscore: Efficient radiance field rendering via architectural support for 3d gaussian splatting. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 497–511

work page 2024

[15] [15]

Haomin Li, Yue Liang, Fangxin Liu, Bowen Zhu, Zongwu Wang, Yu Feng, Liqiang Lu, Li Jiang, and Haibing Guan. 2026. ORANGE: Ex- ploring Ockham’s Razor for Neural Rendering by Accelerating 3DGS on NPUs with GEMM-Friendly Blending and Balanced Workloads. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1–15

work page 2026

[16] [16]

Leshu Li, Jiayin Qin, Jie Peng, Zishen Wan, Huaizhi Qu, Ye Han, Pingqing Zheng, Hongsen Zhang, Yu Cao, Tianlong Chen, and Yang (Katie) Zhao. 2025. RTGS: Real-Time 3D Gaussian Splatting SLAM via Multi-Level Redundancy Reduction. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture. 1838– 1851

work page 2025

[17] [17]

Rongji Liao, Yuan Zhang, Wei Zhang, Lingjun Pu, Yu Guan, Yunpeng Jing, Tao Lin, and Jinyao Yan. 2025. 3DGS-enabled High-fidelity Low- cost Immersive Static 3D Video Streaming.IEEE Journal on Selected Areas in Communications(2025)

work page 2025

[18] [18]

Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. 2020. Neural sparse voxel fields.Advances in Neural Infor- mation Processing Systems33 (2020), 15651–15663

work page 2020

[19] [19]

Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. 2018. Nvidia tensor core programmability, perfor- mance & precision. In2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, 522–531

work page 2018

[20] [20]

Vitor Pereira Matias, Daniel Perazzo, Vinicius Silva, Alberto Raposo, Luiz Velho, Afonso Paiva, and Tiago Novello. 2025. From volume rendering to 3d gaussian splatting: Theory and applications. In2025 38th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI). IEEE, 1–6

work page 2025

[21] [21]

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields for view synthesis.Commun. ACM 65, 1 (2021), 99–106

work page 2021

[22] [22]

Seock-Hwan Noh, Banseok Shin, Jeik Choi, Seungpyo Lee, Jaeha Kung, and Yeseong Kim. 2025. FlexNeRFer: A Multi-Dataflow, Adaptive Sparsity-Aware Accelerator for On-Device NeRF Rendering. InPro- ceedings of the 52nd Annual International Symposium on Computer Architecture. 1894–1909

work page 2025

[23] [23]

Changhun Oh, Seongryong Oh, Jinwoo Hwang, Yoonsung Kim, Hardik Sharma, and Jongse Park. 2026. Neo: Real-Time On-Device 3D Gauss- ian Splatting with Reuse-and-Update Sorting Acceleration. InPro- ceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 1268–1284

work page 2026

[24] [24]

Minnan Pei, Gang Li, Junwen Si, Zeyu Zhu, Zitao Mo, Peisong Wang, Zhuoran Song, Xiaoyao Liang, and Jian Cheng. 2025. GCC: A 3DGS Inference Architecture with Gaussian-Wise and Cross-Stage Condi- tional Processing. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture. 1824–1837

work page 2025

[25] [25]

Shi Qiu, Binzhu Xie, Qixuan Liu, and Pheng-Ann Heng. 2025. Ad- vancing extended reality with 3d gaussian splatting: Innovations and prospects. In2025 IEEE International Conference on Artificial Intelli- gence and eXtended and Virtual Reality (AIxVR). IEEE, 203–208

work page 2025

[26] [26]

Santosh Reddy, H Abhiram, and KS Archish. 2025. A survey of 3D Gaussian splatting: optimization techniques, applications, and AI- driven advancements. In2025 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE). IEEE, 1–6

work page 2025

[27] [27]

Hongyi Wang, Zhenhua Zhu, Tianchen Zhao, Yunfei Xiang, Zehao Wang, Jincheng Yu, Huazhong Yang, Yuan Xie, and Yu Wang. 2025. REACT3D: Real-time Edge Accelerator for Incremental Training in 3D Gaussian Splatting based SLAM Systems. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture. 1852–1866. 12

work page 2025

[28] [28]

Xinzhe Wang, Ran Yi, and Lizhuang Ma. 2024. Adr-gaussian: Acceler- ating gaussian splatting with adaptive radius. InSIGGRAPH Asia 2024 Conference Papers. 1–10

work page 2024

[29] [29]

Rui Wen, Zhifei Yue, Tianbo Liu, Xinkai Song, Jin Li, Di Huang, Jiaming Guo, Xing Hu, Zidong Du, Qi Guo, and Tianshi Chen. 2026. Cambricon- GS: An Accelerator for 3D Gaussian Splatting Training With Gaussian- Pixel Hybrid Parallelism. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1–14

work page 2026

[30] [30]

Lizhou Wu, Haozhe Zhu, Siqi He, Jiapei Zheng, Chixiao Chen, and Xiaoyang Zeng. 2024. Gauspu: 3d gaussian splatting processor for real- time slam systems. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1562–1573

work page 2024

[31] [31]

Yiwei Xu, Yifei Yu, Wentian Gan, Tengfei Wang, Zongqian Zhan, Hao Cheng, and Xin Wang. 2025. Gaussian on-the-fly splatting: A progressive framework for robust near real-time 3dgs optimization. IEEE Robotics and Automation Letters11, 1 (2025), 426–433

work page 2025

[32] [32]

Vickie Ye, Ruilong Li, Justin Kerr, Matias Turkulainen, Brent Yi, Zhuoyang Pan, Otto Seiskari, Jianbo Ye, Jeffrey Hu, Matthew Tan- cik, and Angjoo Kanazawa. 2025. gsplat: An open-source library for Gaussian splatting.Journal of Machine Learning Research26, 34 (2025), 1–17

work page 2025

[33] [33]

Zhifan Ye, Yonggan Fu, Jingqun Zhang, Leshu Li, Yongan Zhang, Sixu Li, Cheng Wan, Chenxi Wan, Chaojian Li, Sreemanth Prathipati, and Yingyan Celine Lin. 2025. Gaussian blending unit: An edge gpu plug-in for real-time gaussian-based rendering in ar/vr. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 353–365

work page 2025

[34] [34]

Hexu Zhao, Xiwen Min, Xiaoteng Liu, Moonjun Gong, Yiming Li, Ang Li, Saining Xie, Jinyang Li, and Aurojit Panda. 2026. Clm: Removing the gpu memory barrier for 3d gaussian splatting. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 377–393

work page 2026

[35] [35]

Zhiyu Zhou, Feng Hui, Xing Li, and Yu Liu. 2025. Visual Localization Using 3D Gaussian Splatting Representation for Mobile Robots With Geometric Feature Correspondences Synthesis.IEEE Transactions on Automation Science and Engineering(2025)

work page 2025

[36] [36]

Siting Zhu, Guangming Wang, Xin Kong, Dezhi Kong, and Hesheng Wang. 2024. 3d gaussian splatting in robotics: A survey.arXiv preprint arXiv:2410.12262(2024). 13

work page arXiv 2024