CStencil on the WSE-3 achieves up to 342x speedup for 2D stencils versus an adapted single-precision GPU solver and saturates both compute and on-chip memory bandwidth.
ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor Cores
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
baseline 1
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2representative citing papers
A fused gather-GEMM-scatter CUDA kernel achieves 4.6-7.3x end-to-end speedup and 3.2-4.9x lower energy for matrix-free 3D SIMP topology optimization on RTX 4090 compared to three-stage baselines.
citing papers explorer
-
Stencil Computations on Cerebras Wafer-Scale Engine
CStencil on the WSE-3 achieves up to 342x speedup for 2D stencils versus an adapted single-precision GPU solver and saturates both compute and on-chip memory bandwidth.
-
Matrix-Free 3D SIMP Topology Optimization with Fused Gather-GEMM-Scatter Kernels
A fused gather-GEMM-scatter CUDA kernel achieves 4.6-7.3x end-to-end speedup and 3.2-4.9x lower energy for matrix-free 3D SIMP topology optimization on RTX 4090 compared to three-stage baselines.