pith. sign in

arxiv: 2604.02120 · v2 · submitted 2026-04-02 · 💻 cs.AR · cs.GR

GEMM-GS: Accelerating 3D Gaussian Splatting on Tensor Cores with GEMM-Compatible Blending

Pith reviewed 2026-05-13 20:49 UTC · model grok-4.3

classification 💻 cs.AR cs.GR
keywords 3D Gaussian SplattingTensor CoresGEMM accelerationCUDA kernel optimizationreal-time renderingblending reformulationGPU acceleration
0
0 comments X

The pith

Reformulating 3D Gaussian Splatting blending as matrix multiplications enables Tensor Core acceleration with 1.42 times speedup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the blending operation in 3D Gaussian Splatting can be rewritten in a form that matches general matrix multiplication, allowing direct use of GPU Tensor Cores. This change delivers a 1.42 times speedup compared to the standard implementation. When combined with other acceleration methods, it adds another 1.47 times improvement on average. The reformulation is mathematically equivalent, preserving image quality exactly. This matters because 3DGS already offers faster rendering than NeRF but still needs further optimization to reach true real-time performance on consumer hardware.

Core claim

By transforming the alpha blending of Gaussians into GEMM operations, the approach maps the 3DGS pipeline onto Tensor Cores. A three-stage double-buffered CUDA kernel overlaps computation with memory access to maximize throughput. Benchmarks show the 1.42x gain over vanilla 3DGS and the additional factor when stacked with prior optimizations.

What carries the argument

The GEMM-compatible blending transformation that recasts per-pixel Gaussian accumulation into matrix multiplication operations.

If this is right

  • Vanilla 3DGS rendering can run faster on existing GPUs without altering the input data or model.
  • Other acceleration techniques for 3DGS become more effective when paired with this hardware mapping.
  • Real-time frame rates become more achievable for complex scenes in applications like AR and simulation.
  • The same reformulation principle may apply to similar volume rendering pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could inspire similar hardware-specific reformulations in other graphics algorithms that currently avoid matrix units.
  • Developers might integrate this into standard 3DGS libraries to make Tensor Core usage automatic.
  • Energy efficiency could improve on devices with dedicated matrix accelerators.

Load-bearing premise

The rewritten blending formula must produce exactly the same numerical results as the original without introducing rounding errors or instability on Tensor Cores.

What would settle it

Comparing pixel values from the original 3DGS renderer and the GEMM-GS version on identical inputs; any systematic difference beyond machine precision would disprove equivalence.

Figures

Figures reproduced from arXiv: 2604.02120 by Bowen Zhu, Fangxin Liu, Haibing Guan, Haomin Li, Li Jiang, Xinran Liang, Zongwu Wang.

Figure 1
Figure 1. Figure 1: Computing Power Breakdown of modern GPUs [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Neural Rendering and Process of 3DGS [10]. (a) Neural rendering (process of novel view synthesis). 3DGS consists of three stages. (b) Stage 1: Preprocessing. Gaussians are projected onto the render image and intersection test is performed to relating projected Gaussians and tiles. Gaussians’ features are also computed, including depth 𝑑 and color c. (c) Stage 2: Duplication. Each Gaussian is duplicated acc… view at source ↗
Figure 3
Figure 3. Figure 3: Rendering Latency Breakdown of 3DGS. The scenes [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: High-Performance GPU Kernel Design and Implementation. (a) Dataflow of 3-stage pipeline configured with double [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average image rendering latency (ms) comparison between GEMM-GS and baseline methods on an H100 GPU. truck train drjohnson playroom flower garden stump Scenes 0 20 40 Latency (ms) Vanilla 3DGS(1×) GeMM-GS(2×) GeMM-GS(1×) Vanilla 3DGS(3×) Vanilla 3DGS(2×) GeMM-GS(3×) [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

Neural Radiance Fields (NeRF) enables 3D scene reconstruction from several 2D images but incurs high rendering latency via its point-sampling design. 3D Gaussian Splatting (3DGS) improves on NeRF with explicit scene representation and an optimized pipeline yet still fails to meet practical real-time demands. Existing acceleration works overlook the evolving Tensor Cores of modern GPUs because 3DGS pipeline lacks General Matrix Multiplication (GEMM) operations. This paper proposes GEMM-GS, an acceleration approach utilizing tensor cores on GPUs via GEMM-friendly blending transformation. It equivalently reformulates the 3DGS blending process into a GEMM-compatible form to utilize Tensor Cores. A high-performance CUDA kernel is designed, integrating a three-stage double-buffered pipeline that overlaps computation and memory access. Extensive experiments show that GEMM-GS achieves $1.42\times$ speedup over vanilla 3DGS and provides an additional $1.47\times$ speedup on average when combining with existing acceleration approaches. Code is released at https://github.com/shieldforever/GEMM-GS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces GEMM-GS, which accelerates 3D Gaussian Splatting by equivalently reformulating the alpha-blending process into a GEMM-compatible form to enable Tensor Core utilization on GPUs. It describes a high-performance CUDA kernel using a three-stage double-buffered pipeline to overlap computation and memory access. Experiments report a 1.42× speedup over vanilla 3DGS and an average additional 1.47× speedup when combined with prior acceleration techniques, with code released publicly.

Significance. If the reformulation is numerically equivalent and preserves quality, the work is significant because it directly exploits Tensor Cores—an underutilized hardware feature in prior 3DGS accelerators—potentially improving real-time rendering performance. The open release of code is a clear strength that supports reproducibility and extension.

major comments (1)
  1. [§3.2] §3.2 (blending reformulation): The central claim of exact algebraic equivalence between the original alpha compositing and the GEMM form must be verified under Tensor Core mixed precision (FP16/TF32 inputs with FP32 accumulation). No error bounds, per-pixel difference statistics, or per-scene PSNR/SSIM deltas between the FP32 baseline and the Tensor-Core path are reported, even though accumulation over hundreds of sorted Gaussians can amplify rounding discrepancies.
minor comments (1)
  1. [Experiments] The experimental section would benefit from an explicit table listing per-scene speedups, hardware platform (GPU model, Tensor Core generation), and exact baseline kernels used for the 1.42× and 1.47× figures.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the constructive comment on numerical verification. We agree that confirming equivalence under Tensor Core mixed precision is essential, especially given potential accumulation effects, and we will incorporate the requested analysis in the revised manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (blending reformulation): The central claim of exact algebraic equivalence between the original alpha compositing and the GEMM form must be verified under Tensor Core mixed precision (FP16/TF32 inputs with FP32 accumulation). No error bounds, per-pixel difference statistics, or per-scene PSNR/SSIM deltas between the FP32 baseline and the Tensor-Core path are reported, even though accumulation over hundreds of sorted Gaussians can amplify rounding discrepancies.

    Authors: We thank the referee for this important observation. The algebraic reformulation in §3.2 is exact in infinite precision, but we acknowledge that Tensor Core mixed-precision (FP16/TF32 inputs with FP32 accumulation) can introduce rounding discrepancies that accumulate over hundreds of Gaussians. In the revised manuscript we will add a dedicated subsection (and supporting appendix) that includes: (1) a brief error-bound analysis derived from the number of terms and FP32 accumulation precision, (2) per-pixel absolute and relative difference statistics (mean/max L1/L2) averaged across the test scenes, and (3) per-scene PSNR/SSIM deltas between the FP32 reference and the Tensor-Core path. These results will be reported for both the standalone GEMM-GS kernel and the combined acceleration settings. We have already collected the necessary data internally and will include the full tables and figures in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: algebraic reformulation is independent of measured speedups

full rationale

The central contribution is an algebraic reformulation of alpha blending into GEMM form, presented as an exact equivalence rather than a fitted model or self-referential definition. Speedup claims (1.42× and 1.47×) are obtained by direct measurement against a baseline CUDA implementation, not by predicting from any parameter fitted to the target data. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work are invoked to justify the core transformation. The derivation chain is therefore self-contained: the mathematical step stands on its own algebraic validity, while empirical results are externally falsifiable via the released code.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard GPU hardware capabilities and the algebraic equivalence of the blending rewrite; no new free parameters, axioms beyond ordinary floating-point arithmetic, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5518 in / 1084 out tokens · 33799 ms · 2026-05-13T20:49:55.942783+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

  1. [1]

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. 2022. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5470–5479

  2. [2]

    Zhenqi Dai, Ting Liu, and Yanning Zhang. 2025. Efficient Decoupled Feature 3D Gaussian Splatting via Hierarchical Compression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11156–11166

  3. [3]

    Zhiwen Fan, Kevin Wang, Kairun Wen, Zehao Zhu, Dejia Xu, Zhangyang Wang, et al. 2024. Lightgaussian: Unbounded 3d gaussian compression with 15x reduc- tion and 200+ fps.Advances in neural information processing systems37 (2024), 140138–140158

  4. [4]

    Guofeng Feng, Siyan Chen, Rong Fu, Zimu Liao, Yi Wang, Tao Liu, Boni Hu, Linning Xu, Zhilin Pei, Hengjie Li, et al . 2025. Flashgs: Efficient 3d gaussian splatting for large-scale and high-resolution rendering. InProceedings of the Computer Vision and Pattern Recognition Conference. 26652–26662

  5. [5]

    Yu Feng, Weikai Lin, Yuge Cheng, Zihan Liu, Jingwen Leng, Minyi Guo, Chen Chen, Shixuan Sun, and Yuhao Zhu. 2025. Lumina: Real-Time Neural Rendering by Exploiting Computational Redundancy. InProceedings of the 52nd Annual International Symposium on Computer Architecture. 1925–1939

  6. [6]

    Sharath Girish, Kamal Gupta, and Abhinav Shrivastava. 2024. Eagles: Efficient accelerated 3d gaussians with lightweight encodings. InEuropean Conference on Computer Vision. Springer, 54–71

  7. [7]

    Alex Hanson, Allen Tu, Geng Lin, Vasu Singla, Matthias Zwicker, and Tom Goldstein. 2025. Speedy-splat: Fast 3d gaussian splatting with sparse pixels and sparse primitives. InProceedings of the Computer Vision and Pattern Recognition Conference. 21537–21546

  8. [8]

    Alex Hanson, Allen Tu, Vasu Singla, Mayuka Jayawardhana, Matthias Zwicker, and Tom Goldstein. 2025. PUP 3D-GS: Principled Uncertainty Pruning for 3D Gaussian Splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5949–5958

  9. [9]

    Peter Hedman, Julien Philip, True Price, Jan-Michael Frahm, George Drettakis, and Gabriel Brostow. 2018. Deep blending for free-viewpoint image-based ren- dering.ACM Transactions on Graphics (ToG)37, 6 (2018), 1–15

  10. [10]

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis

  11. [11]

    Graph.42, 4 (2023), 139–1

    3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph.42, 4 (2023), 139–1

  12. [12]

    Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. 2017. Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics (ToG)36, 4 (2017), 1–13

  13. [13]

    Junseo Lee, Seokwon Lee, Jungi Lee, Junyong Park, and Jaewoong Sim. 2024. Gscore: Efficient radiance field rendering via architectural support for 3d gaussian splatting. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 497–511

  14. [14]

    Joo Chan Lee, Daniel Rho, Xiangyu Sun, Jong Hwan Ko, and Eunbyung Park

  15. [15]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Compact 3d gaussian representation for radiance field. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21719–21728

  16. [16]

    Haomin Li, Yue Liang, Fangxin Liu, Bowen Zhu, Zongwu Wang, Yu Feng, Liqiang Lu, Li Jiang, and Haibing Guan. 2026. ORANGE: Exploring Ockham’s Razor for Neural Rendering by Accelerating 3DGS on NPUs with GEMM-Friendly Blending and Balanced Workloads. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1–15

  17. [17]

    Zimu Liao, Jifeng Ding, Siwei Cui, Ruixuan Gong, Boni Hu, Yi Wang, Hengjie Li, Hui Wang, Xingcheng Zhang, and Rong Fu. 2025. Tc-gs: A faster gaussian splatting module utilizing tensor cores. InProceedings of the SIGGRAPH Asia 2025 Conference Papers. 1–9

  18. [18]

    Weikai Lin, Yu Feng, and Yuhao Zhu. 2025. Metasapiens: Real-time neural rendering with efficiency-aware pruning and accelerated foveated rendering. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 669–682

  19. [19]

    Fangxin Liu, Haomin Li, Bowen Zhu, Zongwu Wang, Zhuoran Song, Haibing Guan, and Li Jiang. 2025. Asdr: Exploiting adaptive sampling and data reuse for cim-based instant neural rendering. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 18–33

  20. [20]

    Saswat Subhajyoti Mallick, Rahul Goel, Bernhard Kerbl, Markus Steinberger, Francisco Vicente Carrasco, and Fernando De La Torre. 2024. Taming 3dgs: High- quality radiance fields with limited resources. InSIGGRAPH Asia 2024 Conference Papers. 1–11

  21. [21]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2021. NeRF: representing scenes as neural radiance fields for view synthesis.Commun. ACM65, 1 (dec 2021), 99–106. doi: 10.1145/ 3503250

  22. [22]

    K Navaneet, Kossar Pourahmadi Meibodi, Soroush Abbasi Koohpayegani, and Hamed Pirsiavash. 2023. Compact3d: Compressing gaussian splat radiance field models with vector quantization.arXiv preprint arXiv:2311.181592, 3 (2023)

  23. [23]

    Simon Niedermayr, Josef Stumpfegger, and Rüdiger Westermann. 2024. Com- pressed 3d gaussian splatting for accelerated novel view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10349– 10358

  24. [24]

    Nvidia. 2017. V100 Data Sheet. https://images.nvidia.cn/content/technologies/ volta/pdf/volta-v100-datasheet-update-us-1165301-r5.pdf

  25. [25]

    Nvidia. 2020. A100 Data Sheet. https://www.nvidia.com/content/dam/en- zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-nvidia-us-2188504- web.pdf

  26. [26]

    Nvidia. 2022. H100 Data Sheet. https://resources.nvidia.com/en-us-hopper- architecture/nvidia-tensor-core-gpu-datasheet

  27. [27]

    Nvidia. 2023. H200 Data Sheet. https://nvdam.widen.net/s/nb5zzzsjdf/hpc- datasheet-sc23-h200-datasheet-3002446

  28. [28]

    Nvidia. 2024. B200 Data Sheet. https://nvdam.widen.net/s/wwnsxrhm2w/black well-datasheet-3384703

  29. [29]

    Panagiotis Papantonakis, Georgios Kopanas, Bernhard Kerbl, Alexandre Lanvin, and George Drettakis. 2024. Reducing the memory footprint of 3d gaussian splatting.Proceedings of the ACM on Computer Graphics and Interactive Techniques 7, 1 (2024), 1–17

  30. [30]

    Lukas Radl, Michael Steiner, Mathias Parger, Alexander Weinrauch, Bernhard Kerbl, and Markus Steinberger. 2024. Stopthepop: Sorted gaussian splatting for view-consistent real-time rendering.ACM Transactions on Graphics (TOG)43, 4 (2024), 1–17

  31. [31]

    Jiaming Sun, Xi Chen, Qianqian Wang, Zhengqi Li, Hadar Averbuch-Elor, Xiaowei Zhou, and Noah Snavely. 2022. Neural 3D Reconstruction in the Wild. InACM SIG- GRAPH 2022 Conference Proceedings (SIGGRAPH ’22). Association for Computing Machinery, New York, NY, USA, Article 26, 9 pages. doi: 10.1145/3528233.3530718

  32. [32]

    Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul P Srinivasan, Jonathan T Barron, and Henrik Kretzschmar. 2022. Block- nerf: Scalable large scene neural view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8248–8258

  33. [33]

    Chen Wang, Xian Wu, Yuan-Chen Guo, Song-Hai Zhang, Yu-Wing Tai, and Shi- Min Hu. 2022. Nerf-sr: High quality neural radiance fields using supersampling. In Proceedings of the 30th ACM International Conference on Multimedia. 6445–6454

  34. [34]

    Lizhou Wu, Haozhe Zhu, Siqi He, Jiapei Zheng, Chixiao Chen, and Xiaoyang Zeng. 2024. GauSPU: 3D Gaussian Splatting Processor for Real-Time SLAM Systems. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1562–1573

  35. [35]

    Allan Zhou, Moo Jin Kim, Lirui Wang, Pete Florence, and Chelsea Finn. 2023. NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel- View Synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 17907–17917. 7