GEMM-GS: Accelerating 3D Gaussian Splatting on Tensor Cores with GEMM-Compatible Blending
Pith reviewed 2026-05-13 20:49 UTC · model grok-4.3
The pith
Reformulating 3D Gaussian Splatting blending as matrix multiplications enables Tensor Core acceleration with 1.42 times speedup.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By transforming the alpha blending of Gaussians into GEMM operations, the approach maps the 3DGS pipeline onto Tensor Cores. A three-stage double-buffered CUDA kernel overlaps computation with memory access to maximize throughput. Benchmarks show the 1.42x gain over vanilla 3DGS and the additional factor when stacked with prior optimizations.
What carries the argument
The GEMM-compatible blending transformation that recasts per-pixel Gaussian accumulation into matrix multiplication operations.
If this is right
- Vanilla 3DGS rendering can run faster on existing GPUs without altering the input data or model.
- Other acceleration techniques for 3DGS become more effective when paired with this hardware mapping.
- Real-time frame rates become more achievable for complex scenes in applications like AR and simulation.
- The same reformulation principle may apply to similar volume rendering pipelines.
Where Pith is reading between the lines
- This could inspire similar hardware-specific reformulations in other graphics algorithms that currently avoid matrix units.
- Developers might integrate this into standard 3DGS libraries to make Tensor Core usage automatic.
- Energy efficiency could improve on devices with dedicated matrix accelerators.
Load-bearing premise
The rewritten blending formula must produce exactly the same numerical results as the original without introducing rounding errors or instability on Tensor Cores.
What would settle it
Comparing pixel values from the original 3DGS renderer and the GEMM-GS version on identical inputs; any systematic difference beyond machine precision would disprove equivalence.
Figures
read the original abstract
Neural Radiance Fields (NeRF) enables 3D scene reconstruction from several 2D images but incurs high rendering latency via its point-sampling design. 3D Gaussian Splatting (3DGS) improves on NeRF with explicit scene representation and an optimized pipeline yet still fails to meet practical real-time demands. Existing acceleration works overlook the evolving Tensor Cores of modern GPUs because 3DGS pipeline lacks General Matrix Multiplication (GEMM) operations. This paper proposes GEMM-GS, an acceleration approach utilizing tensor cores on GPUs via GEMM-friendly blending transformation. It equivalently reformulates the 3DGS blending process into a GEMM-compatible form to utilize Tensor Cores. A high-performance CUDA kernel is designed, integrating a three-stage double-buffered pipeline that overlaps computation and memory access. Extensive experiments show that GEMM-GS achieves $1.42\times$ speedup over vanilla 3DGS and provides an additional $1.47\times$ speedup on average when combining with existing acceleration approaches. Code is released at https://github.com/shieldforever/GEMM-GS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GEMM-GS, which accelerates 3D Gaussian Splatting by equivalently reformulating the alpha-blending process into a GEMM-compatible form to enable Tensor Core utilization on GPUs. It describes a high-performance CUDA kernel using a three-stage double-buffered pipeline to overlap computation and memory access. Experiments report a 1.42× speedup over vanilla 3DGS and an average additional 1.47× speedup when combined with prior acceleration techniques, with code released publicly.
Significance. If the reformulation is numerically equivalent and preserves quality, the work is significant because it directly exploits Tensor Cores—an underutilized hardware feature in prior 3DGS accelerators—potentially improving real-time rendering performance. The open release of code is a clear strength that supports reproducibility and extension.
major comments (1)
- [§3.2] §3.2 (blending reformulation): The central claim of exact algebraic equivalence between the original alpha compositing and the GEMM form must be verified under Tensor Core mixed precision (FP16/TF32 inputs with FP32 accumulation). No error bounds, per-pixel difference statistics, or per-scene PSNR/SSIM deltas between the FP32 baseline and the Tensor-Core path are reported, even though accumulation over hundreds of sorted Gaussians can amplify rounding discrepancies.
minor comments (1)
- [Experiments] The experimental section would benefit from an explicit table listing per-scene speedups, hardware platform (GPU model, Tensor Core generation), and exact baseline kernels used for the 1.42× and 1.47× figures.
Simulated Author's Rebuttal
We thank the referee for the detailed review and the constructive comment on numerical verification. We agree that confirming equivalence under Tensor Core mixed precision is essential, especially given potential accumulation effects, and we will incorporate the requested analysis in the revised manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (blending reformulation): The central claim of exact algebraic equivalence between the original alpha compositing and the GEMM form must be verified under Tensor Core mixed precision (FP16/TF32 inputs with FP32 accumulation). No error bounds, per-pixel difference statistics, or per-scene PSNR/SSIM deltas between the FP32 baseline and the Tensor-Core path are reported, even though accumulation over hundreds of sorted Gaussians can amplify rounding discrepancies.
Authors: We thank the referee for this important observation. The algebraic reformulation in §3.2 is exact in infinite precision, but we acknowledge that Tensor Core mixed-precision (FP16/TF32 inputs with FP32 accumulation) can introduce rounding discrepancies that accumulate over hundreds of Gaussians. In the revised manuscript we will add a dedicated subsection (and supporting appendix) that includes: (1) a brief error-bound analysis derived from the number of terms and FP32 accumulation precision, (2) per-pixel absolute and relative difference statistics (mean/max L1/L2) averaged across the test scenes, and (3) per-scene PSNR/SSIM deltas between the FP32 reference and the Tensor-Core path. These results will be reported for both the standalone GEMM-GS kernel and the combined acceleration settings. We have already collected the necessary data internally and will include the full tables and figures in the revision. revision: yes
Circularity Check
No circularity: algebraic reformulation is independent of measured speedups
full rationale
The central contribution is an algebraic reformulation of alpha blending into GEMM form, presented as an exact equivalence rather than a fitted model or self-referential definition. Speedup claims (1.42× and 1.47×) are obtained by direct measurement against a baseline CUDA implementation, not by predicting from any parameter fitted to the target data. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work are invoked to justify the core transformation. The derivation chain is therefore self-contained: the mathematical step stands on its own algebraic validity, while empirical results are externally falsifiable via the released code.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. 2022. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5470–5479
work page 2022
-
[2]
Zhenqi Dai, Ting Liu, and Yanning Zhang. 2025. Efficient Decoupled Feature 3D Gaussian Splatting via Hierarchical Compression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11156–11166
work page 2025
-
[3]
Zhiwen Fan, Kevin Wang, Kairun Wen, Zehao Zhu, Dejia Xu, Zhangyang Wang, et al. 2024. Lightgaussian: Unbounded 3d gaussian compression with 15x reduc- tion and 200+ fps.Advances in neural information processing systems37 (2024), 140138–140158
work page 2024
-
[4]
Guofeng Feng, Siyan Chen, Rong Fu, Zimu Liao, Yi Wang, Tao Liu, Boni Hu, Linning Xu, Zhilin Pei, Hengjie Li, et al . 2025. Flashgs: Efficient 3d gaussian splatting for large-scale and high-resolution rendering. InProceedings of the Computer Vision and Pattern Recognition Conference. 26652–26662
work page 2025
-
[5]
Yu Feng, Weikai Lin, Yuge Cheng, Zihan Liu, Jingwen Leng, Minyi Guo, Chen Chen, Shixuan Sun, and Yuhao Zhu. 2025. Lumina: Real-Time Neural Rendering by Exploiting Computational Redundancy. InProceedings of the 52nd Annual International Symposium on Computer Architecture. 1925–1939
work page 2025
-
[6]
Sharath Girish, Kamal Gupta, and Abhinav Shrivastava. 2024. Eagles: Efficient accelerated 3d gaussians with lightweight encodings. InEuropean Conference on Computer Vision. Springer, 54–71
work page 2024
-
[7]
Alex Hanson, Allen Tu, Geng Lin, Vasu Singla, Matthias Zwicker, and Tom Goldstein. 2025. Speedy-splat: Fast 3d gaussian splatting with sparse pixels and sparse primitives. InProceedings of the Computer Vision and Pattern Recognition Conference. 21537–21546
work page 2025
-
[8]
Alex Hanson, Allen Tu, Vasu Singla, Mayuka Jayawardhana, Matthias Zwicker, and Tom Goldstein. 2025. PUP 3D-GS: Principled Uncertainty Pruning for 3D Gaussian Splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5949–5958
work page 2025
-
[9]
Peter Hedman, Julien Philip, True Price, Jan-Michael Frahm, George Drettakis, and Gabriel Brostow. 2018. Deep blending for free-viewpoint image-based ren- dering.ACM Transactions on Graphics (ToG)37, 6 (2018), 1–15
work page 2018
-
[10]
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis
-
[11]
3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph.42, 4 (2023), 139–1
work page 2023
-
[12]
Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. 2017. Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics (ToG)36, 4 (2017), 1–13
work page 2017
-
[13]
Junseo Lee, Seokwon Lee, Jungi Lee, Junyong Park, and Jaewoong Sim. 2024. Gscore: Efficient radiance field rendering via architectural support for 3d gaussian splatting. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 497–511
work page 2024
-
[14]
Joo Chan Lee, Daniel Rho, Xiangyu Sun, Jong Hwan Ko, and Eunbyung Park
-
[15]
InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Compact 3d gaussian representation for radiance field. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21719–21728
-
[16]
Haomin Li, Yue Liang, Fangxin Liu, Bowen Zhu, Zongwu Wang, Yu Feng, Liqiang Lu, Li Jiang, and Haibing Guan. 2026. ORANGE: Exploring Ockham’s Razor for Neural Rendering by Accelerating 3DGS on NPUs with GEMM-Friendly Blending and Balanced Workloads. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1–15
work page 2026
-
[17]
Zimu Liao, Jifeng Ding, Siwei Cui, Ruixuan Gong, Boni Hu, Yi Wang, Hengjie Li, Hui Wang, Xingcheng Zhang, and Rong Fu. 2025. Tc-gs: A faster gaussian splatting module utilizing tensor cores. InProceedings of the SIGGRAPH Asia 2025 Conference Papers. 1–9
work page 2025
-
[18]
Weikai Lin, Yu Feng, and Yuhao Zhu. 2025. Metasapiens: Real-time neural rendering with efficiency-aware pruning and accelerated foveated rendering. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 669–682
work page 2025
-
[19]
Fangxin Liu, Haomin Li, Bowen Zhu, Zongwu Wang, Zhuoran Song, Haibing Guan, and Li Jiang. 2025. Asdr: Exploiting adaptive sampling and data reuse for cim-based instant neural rendering. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 18–33
work page 2025
-
[20]
Saswat Subhajyoti Mallick, Rahul Goel, Bernhard Kerbl, Markus Steinberger, Francisco Vicente Carrasco, and Fernando De La Torre. 2024. Taming 3dgs: High- quality radiance fields with limited resources. InSIGGRAPH Asia 2024 Conference Papers. 1–11
work page 2024
-
[21]
Srinivasan, Matthew Tancik, Jonathan T
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2021. NeRF: representing scenes as neural radiance fields for view synthesis.Commun. ACM65, 1 (dec 2021), 99–106. doi: 10.1145/ 3503250
work page 2021
- [22]
-
[23]
Simon Niedermayr, Josef Stumpfegger, and Rüdiger Westermann. 2024. Com- pressed 3d gaussian splatting for accelerated novel view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10349– 10358
work page 2024
-
[24]
Nvidia. 2017. V100 Data Sheet. https://images.nvidia.cn/content/technologies/ volta/pdf/volta-v100-datasheet-update-us-1165301-r5.pdf
work page 2017
-
[25]
Nvidia. 2020. A100 Data Sheet. https://www.nvidia.com/content/dam/en- zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-nvidia-us-2188504- web.pdf
work page 2020
-
[26]
Nvidia. 2022. H100 Data Sheet. https://resources.nvidia.com/en-us-hopper- architecture/nvidia-tensor-core-gpu-datasheet
work page 2022
-
[27]
Nvidia. 2023. H200 Data Sheet. https://nvdam.widen.net/s/nb5zzzsjdf/hpc- datasheet-sc23-h200-datasheet-3002446
work page 2023
-
[28]
Nvidia. 2024. B200 Data Sheet. https://nvdam.widen.net/s/wwnsxrhm2w/black well-datasheet-3384703
work page 2024
-
[29]
Panagiotis Papantonakis, Georgios Kopanas, Bernhard Kerbl, Alexandre Lanvin, and George Drettakis. 2024. Reducing the memory footprint of 3d gaussian splatting.Proceedings of the ACM on Computer Graphics and Interactive Techniques 7, 1 (2024), 1–17
work page 2024
-
[30]
Lukas Radl, Michael Steiner, Mathias Parger, Alexander Weinrauch, Bernhard Kerbl, and Markus Steinberger. 2024. Stopthepop: Sorted gaussian splatting for view-consistent real-time rendering.ACM Transactions on Graphics (TOG)43, 4 (2024), 1–17
work page 2024
-
[31]
Jiaming Sun, Xi Chen, Qianqian Wang, Zhengqi Li, Hadar Averbuch-Elor, Xiaowei Zhou, and Noah Snavely. 2022. Neural 3D Reconstruction in the Wild. InACM SIG- GRAPH 2022 Conference Proceedings (SIGGRAPH ’22). Association for Computing Machinery, New York, NY, USA, Article 26, 9 pages. doi: 10.1145/3528233.3530718
-
[32]
Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul P Srinivasan, Jonathan T Barron, and Henrik Kretzschmar. 2022. Block- nerf: Scalable large scene neural view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8248–8258
work page 2022
-
[33]
Chen Wang, Xian Wu, Yuan-Chen Guo, Song-Hai Zhang, Yu-Wing Tai, and Shi- Min Hu. 2022. Nerf-sr: High quality neural radiance fields using supersampling. In Proceedings of the 30th ACM International Conference on Multimedia. 6445–6454
work page 2022
-
[34]
Lizhou Wu, Haozhe Zhu, Siqi He, Jiapei Zheng, Chixiao Chen, and Xiaoyang Zeng. 2024. GauSPU: 3D Gaussian Splatting Processor for Real-Time SLAM Systems. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1562–1573
work page 2024
-
[35]
Allan Zhou, Moo Jin Kim, Lirui Wang, Pete Florence, and Chelsea Finn. 2023. NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel- View Synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 17907–17917. 7
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.