CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
Pith reviewed 2026-05-21 07:17 UTC · model grok-4.3
The pith
Transformer operators can be reparameterized to execute as epilogues while GEMM output tiles remain on chip.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CODA fixes the GEMM mainloop and exposes a small set of composable epilogue primitives so that normalization, activations, residual updates, and related computations execute while a GEMM output tile is still resident on chip, before any write to global memory. This reparameterization covers nearly all non-attention computation in the forward and backward pass of a standard Transformer block and preserves the performance structure of expert-written GEMMs.
What carries the argument
The GEMM-plus-epilogue programming model that fixes the mainloop and provides composable primitives for scaling, reductions, pairwise transformations, and accumulation.
If this is right
- Nearly all non-attention work in forward and backward passes of a standard Transformer block fits inside the epilogue interface.
- Both human-written and LLM-written CODA kernels reach high performance on representative workloads.
- GEMM-plus-epilogue programming provides a practical route to framework productivity while retaining hardware efficiency.
Where Pith is reading between the lines
- The same on-chip epilogue style might be applied to attention blocks if the mainloop can be extended without losing GEMM efficiency.
- Automated code generators could target the constrained epilogue interface to produce kernels for new model variants without manual tuning.
Load-bearing premise
The small set of epilogue primitives is expressive enough to cover nearly all non-attention computation without forcing the GEMM mainloop to be rewritten.
What would settle it
Measure end-to-end training throughput of a full Transformer block implemented entirely with CODA kernels versus the same block using separate framework kernels for normalization, activations, and residuals.
Figures
read the original abstract
Transformer training systems are built around dense linear algebra, yet a nontrivial fraction of end-to-end time is spent on surrounding memory-bound operators. Normalization, activations, residual updates, reductions, and related computations repeatedly move large intermediate tensors through global memory while performing little arithmetic, making data movement an increasingly important bottleneck in otherwise highly optimized training stacks. We introduce CODA, a GPU kernel abstraction that expresses these computations as GEMM-plus-epilogue programs. CODA is based on the observation that many Transformer operators exposed as separate framework kernels can be algebraically reparameterized to execute while a GEMM output tile remains on chip, before it is written to memory. The abstraction fixes the GEMM mainloop and exposes a small set of composable epilogue primitives for scaling, reductions, pairwise transformations, and accumulation. This constrained interface preserves the performance structure of expert-written GEMMs while remaining expressive enough to cover nearly all non-attention computation in the forward and backward pass of a standard Transformer block. Across representative Transformer workloads, both human- and LLM-authored CODA kernels achieve high performance, suggesting that GEMM-plus-epilogue programming offers a practical path toward combining framework-level productivity with hardware-level efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CODA, a GPU kernel abstraction that rewrites many non-attention Transformer operators (normalization, activations, residuals, reductions) as GEMM-plus-epilogue programs. The core observation is that these operators can be algebraically reparameterized to execute on GEMM output tiles while they remain on-chip, before writing to global memory. The abstraction fixes the GEMM mainloop and exposes a constrained set of composable epilogue primitives (scaling, reductions, pairwise transformations, accumulation) that is claimed to cover nearly all non-attention computation in standard Transformer forward and backward passes while preserving expert GEMM performance. Both human- and LLM-authored CODA kernels are reported to achieve high performance across representative workloads.
Significance. If the algebraic reparameterization and epilogue expressiveness hold, the work offers a practical route to fusing memory-bound operators with high-performance GEMMs, reducing data movement in Transformer training stacks. The parameter-free algebraic approach and the constrained yet composable epilogue interface are strengths that could improve both productivity and efficiency over ad-hoc kernel fusion. The manuscript's emphasis on LLM-authored kernels also highlights a potential path toward automated kernel generation.
major comments (1)
- [§3.2 and §4.2] §3.2 (Epilogue Primitives) and §4.2 (Norm Fusion): The reduction primitive is presented as operating on the GEMM output tile to enable on-chip LayerNorm/RMSNorm. However, standard GEMM tiling produces output tiles whose N-dimension (typically 128) is much smaller than the hidden dimension (e.g., 4096). The text does not specify how partial row reductions for mean/variance are accumulated across tiles using only registers or shared memory without intermediate global writes; if cross-tile communication requires extra memory traffic, the claimed elimination of separate memory-bound kernels for per-token norms does not hold.
minor comments (2)
- [Abstract and §5] The abstract and results sections state that CODA kernels achieve 'high performance' but provide no quantitative metrics, baselines, or error bars; adding a table with speedups or roofline comparisons would strengthen the performance claims.
- [§3.2] Notation for the epilogue primitive signatures (e.g., how reduction scope is parameterized) could be clarified with a small example in §3.2 to make the interface more accessible.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive feedback. The observation about cross-tile reduction mechanics for norms is a valid point on presentation clarity, and we address it directly below.
read point-by-point responses
-
Referee: [§3.2 and §4.2] §3.2 (Epilogue Primitives) and §4.2 (Norm Fusion): The reduction primitive is presented as operating on the GEMM output tile to enable on-chip LayerNorm/RMSNorm. However, standard GEMM tiling produces output tiles whose N-dimension (typically 128) is much smaller than the hidden dimension (e.g., 4096). The text does not specify how partial row reductions for mean/variance are accumulated across tiles using only registers or shared memory without intermediate global writes; if cross-tile communication requires extra memory traffic, the claimed elimination of separate memory-bound kernels for per-token norms does not hold.
Authors: We agree that the manuscript does not provide an explicit description of cross-tile accumulation. In the CODA epilogue design, each output tile performs intra-tile reductions for partial sums and sums-of-squares using registers and shared memory. Across tiles spanning the full hidden dimension of a row, partial statistics are aggregated through a compact per-row auxiliary buffer in global memory via atomic additions executed within the same kernel launch. This keeps the dominant activation tensors on-chip during the GEMM mainloop and epilogue while avoiding separate full-tensor kernel launches. We acknowledge that this incurs limited additional global traffic proportional to tile count rather than tensor size; the net reduction in memory movement relative to unfused baselines remains substantial. We will revise §§3.2 and 4.2 to include a precise description, pseudocode, and a diagram of the multi-tile reduction flow. revision: yes
Circularity Check
No circularity: algebraic reparameterization is self-contained observation
full rationale
The paper's derivation rests on an algebraic observation that non-attention Transformer operators can be reparameterized to run as GEMM epilogues while the output tile stays on-chip. This is presented directly as an observation in the abstract and does not reduce to a self-definitional loop, a fitted parameter renamed as a prediction, or any load-bearing self-citation. The claim that the constrained epilogue primitives (scaling, reductions, pairwise transformations, accumulation) cover nearly all such operators is an independent expressiveness assertion rather than a tautology or imported uniqueness result. No equations or steps in the provided text equate the output to the input by construction. The approach is therefore self-contained against external benchmarks of kernel fusion and data-movement reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Algebraic reparameterization of normalization, activation, residual, and reduction operators preserves semantics when fused into a GEMM epilogue
- domain assumption The performance structure of expert-written GEMM mainloops is preserved when only the epilogue is modified
invented entities (1)
-
CODA epilogue primitives
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CODA keeps the GEMM mainloop fixed and exposes a small set of composable epilogue primitives for scaling, reductions, pairwise transformations, and accumulation.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The abstraction fixes the GEMM mainloop and exposes a small set of composable epilogue primitives...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. V oznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, et al. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. InProceedings of the 29th ACM international conference on architectural support for programming languages and operating systems, volume 2...
work page 2024
-
[2]
T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y . Hu, L. Ceze, et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018
work page 2018
- [3]
-
[4]
Z. Chen, A. Kerr, R. Cai, J. Kosaian, H. Wu, Y . Ding, and Y . Xie. Evt: Accelerating deep learning training with epilogue visitor tree. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages 301–316, 2024
work page 2024
-
[5]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [6]
- [7]
-
[8]
Z. Jia, O. Padon, J. Thomas, T. Warszawski, M. Zaharia, and A. Aiken. Taso: optimizing deep learning computation with automatic generation of graph substitutions. InProceedings of the 27th ACM Symposium on Operating Systems Principles, pages 47–62, 2019
work page 2019
-
[9]
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023
work page 2023
- [10]
- [11]
-
[12]
KernelBench: Can LLMs Write Efficient GPU Kernels?
A. Ouyang, S. Guo, S. Arora, A. L. Zhang, W. Hu, C. Ré, and A. Mirhoseini. Kernelbench: Can llms write efficient gpu kernels?arXiv preprint arXiv:2502.10517, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
B. Spector, J. Juravsky, S. Sul, O. Dugan, D. Lim, D. Fu, S. Arora, and C. Ré. Look ma, no bubbles! designing a low-latency megakernel for llama-1b, 2025
work page 2025
- [14]
-
[15]
J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
work page 2024
- [16]
- [17]
-
[18]
V . Thakkar, P. Ramani, C. Cecka, A. Shivam, H. Lu, E. Yan, J. Kosaian, M. Hoemmen, H. Wu, A. Kerr, M. Nicely, D. Merrill, D. Blasig, A. Atluri, F. Qiao, P. Majcher, P. Springer, M. Hohnerbach, J. Wang, and M. Gupta. CUTLASS, Jan. 2023. URL https://github.com/ NVIDIA/cutlass
work page 2023
-
[19]
P. Tillet, H.-T. Kung, and D. Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019
work page 2019
- [20]
-
[21]
E. Wijmans, B. Huval, A. Hertzberg, V . Koltun, and P. Krähenbühl. Cut your losses in large- vocabulary language models. InInternational Conference on Learning Representations, 2025
work page 2025
-
[22]
M. Wu, X. Cheng, S. Liu, C. Shi, J. Ji, M. K. Ao, P. Velliengiri, X. Miao, O. Padon, and Z. Jia. Mirage: A {Multi-Level} superoptimizer for tensor programs. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 21–38, 2025
work page 2025
-
[23]
Z. Ye, L. Chen, R. Lai, W. Lin, Y . Zhang, S. Wang, T. Chen, B. Kasikci, V . Grover, A. Krishna- murthy, et al. Flashinfer: Efficient and customizable attention engine for llm inference serving. Proceedings of Machine Learning and Systems, 7, 2025
work page 2025
-
[24]
Learning to Discover at Test Time
M. Yuksekgonul, D. Koceja, X. Li, F. Bianchi, J. McCaleb, X. Wang, J. Kautz, Y . Choi, J. Zou, C. Guestrin, et al. Learning to discover at test time.arXiv preprint arXiv:2601.16175, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
L. Zheng, L. Yin, Z. Xie, C. L. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37:62557–62583, 2024. 11 A Backward Pass A.1 Tile-wise Epilogue Partition the GEMM output h into tiles h[i,j]. A tile-wise epil...
work page 2024
-
[26]
, 25i n i t _ v a l u e = init_value , 26) 27 28 29class E V T R o w V e c M u l P o s t A c t ( E p i l o g u e V i s i t o r T r e e ) : 30" " " 31Loads a per - N row vector W ( cp . async to smem , then s2r ) , m u l t i p l i e s the 32a c c u m u l a t o r by W into a se par at e r egi st er tile , and stores that scaled 33tile to a side output m Po ...
-
[27]
-> None : 101super () . __i ni t_ _ () 102self . arch = 90 103self . a c c _ d t y p e = a c c _ d t y p e 104self . p o s t _ a c t _ d t y p e = p o s t _ a c t _ d t y p e 105self . c o n t a i n e r _ d t y p e = p o s t _ a c t _ d t y p e 106self . t i l e _ s h a p e _ m n k = t i l e _ s h a p e _ m n k 107self . b u f f e r _ a l i g n _ b y t e ...
-
[28]
c o n s t _ e x p r ( e pi _ar gs
-> E p i l o g u e P a r a m s : 117 118if cutlass . c o n s t _ e x p r ( e pi _ar gs . mPo st Ac t is not None ) : 119mP os tA ct = m i s c _ u t i l s . s t a t i c _ a s s e r t _ i s _ T e n s o r ( ep i_ arg s . mP os tA ct ) 120m i s c _ u t i l s . s t a t i c _ a s s e r t ( g e t _ d t y p e ( mP os tA ct ) is self . c o n t a i n e r _ d t y p ...
-
[29]
= e p i l o g u e _ u t i l s . p r e p a r e _ t m a ( 127tma_op = " s2g " , 128e pi _t il e = epi_tile , 129e p i _ s t a g e = epi_stage , 130e p i _ t e n s o r = mPostAct , 131) 132 133if cutlass . c o n s t _ e x p r ( e pi _ar gs . mRowVec is not None ) : 134m i s c _ u t i l s . s t a t i c _ a s s e r t ( e pi _a rgs . mPo st Ac t is not None ) 1...
-
[30]
-> None : 156cute . nvgpu . cpasync . p r e f e t c h _ d e s c r i p t o r ( e p i _ p a r a m s . e p i _ t m a _ a t o m ) 157 158@cute . jit 159def c o n s u m e r _ b e g i n ( 160self , 161t i l e d _ c o p y _ r 2 s : cute . TiledCopy , 162t i l e _ c o o r d _ m n k l : cute . Coord , 163tidx : cute . Int32 , 164t i l e d _ m m a : cute . TiledMma...
-
[31]
t i l e _ s h a p e _ m n k [0] 175tile_N = self
-> E p i l o g u e T e n s o r s : 173 16 174tile_M = self . t i l e _ s h a p e _ m n k [0] 175tile_N = self . t i l e _ s h a p e _ m n k [1] 176m_idx , n_idx , _ , b a t c h _ i d x = t i l e _ c o o r d _ m n k l 177t h r _ c o p y _ r 2 s = t i l e d _ c o p y _ r 2 s . g e t _ s l i c e ( tidx ) 178 179# Side output ( PostAct ) TMA setup 180mPo st A...
-
[32]
jit 261def c o n s u m e r _ b e g i n _ l o o p ( 262self , 263e p i _ c o o r d : cute
-> None : 258pass 259 260@cute . jit 261def c o n s u m e r _ b e g i n _ l o o p ( 262self , 263e p i _ c o o r d : cute . Coord , 264e p i _ p a r a m s : EpilogueParams , 265e p i _ t e n s o r s : EpilogueTensors , 266e p i _ p i p e l i n e s : E p i l o g u e P i p e l i n e s ,
-
[33]
c o n s t _ e x p r ( e p i _ t e n s o r s
-> tuple [ E p i l o g u e T e n s o r s L o o p , E p i l o g u e P i p e l i n e s ]: 268 269if cutlass . c o n s t _ e x p r ( e p i _ t e n s o r s . t D s R o w V e c is not None ) : 270t D s R o w V e c = m i s c _ u t i l s . s t a t i c _ a s s e r t _ i s _ T e n s o r ( e p i _ t e n s o r s . t D s R o w V e c ) 271t D s R o w V e c _ c u r = c...
-
[34]
E p i l o g u e P i p e l i n e s () , 288) 289 290@cute
, 287self . E p i l o g u e P i p e l i n e s () , 288) 289 290@cute . jit 291def c o n s u m e r _ v i s i t ( 292self , 293tRS_rD : cute . Tensor , 294s h a p e _ m n k : cute . Shape , 295e p i _ p a r a m s : EpilogueParams , 296e p i _ t e n s o r s _ l o o p : E p i l o g u e T e n s o r s L o o p ,
-
[35]
-> E p i l o g u e T e n s o r s L o o p : 298 299t R S _ r P o s t A c t = c r e a t i o n _ u t i l s . a l l o c a t e _ t e n s o r _ l i k e ( 300tensor = tRS_rD , 301me ms pa ce = " rmem " , 302s m e m _ a l l o c a t o r = None , 303dtype = self . acc_dtype , 304) 305if cutlass . c o n s t _ e x p r ( self . arch < 100) : 306if cutlass . c o n s t ...
-
[36]
t i l e d _ c o p y _ p o s t a c t _ r 2 s 340t R S _ r P o s t A c t = m i s c _ u t i l s
-> None : 339t i l e d _ c o p y = e p i _ t e n s o r s _ l o o p . t i l e d _ c o p y _ p o s t a c t _ r 2 s 340t R S _ r P o s t A c t = m i s c _ u t i l s . s t a t i c _ a s s e r t _ i s _ T e n s o r ( e p i _ t e n s o r s _ l o o p . t R S _ r P o s t A c t ) 341t R S _ s P o s t A c t = m i s c _ u t i l s . s t a t i c _ a s s e r t _ i s _ ...
-
[37]
e p i _ t m a _ a t o m 355t D s P o s t A c t = m i s c _ u t i l s
-> None : 354atom = e p i _ t e n s o r s _ l o o p . e p i _ t m a _ a t o m 355t D s P o s t A c t = m i s c _ u t i l s . s t a t i c _ a s s e r t _ i s _ T e n s o r ( e p i _ t e n s o r s _ l o o p . t D s P o s t A c t ) 356t D g P o s t A c t = m i s c _ u t i l s . s t a t i c _ a s s e r t _ i s _ T e n s o r ( e p i _ t e n s o r s _ l o o p ....
-
[38]
c o n s t _ e x p r ( e p i _ p a r a m s
-> type [ E p i l o g u e S h a r e d S t o r a g e ]: 368 369if cutlass . c o n s t _ e x p r ( e p i _ p a r a m s . mP os tA ct is not None ) : 370p o s t _ a c t _ s m e m _ s i z e = cute . cosize ( e p i _ p a r a m s . e p i _ s m e m _ l a y o u t _ s t a g e d ) 371else : 372p o s t _ a c t _ s m e m _ s i z e = 0 373 374if cutlass . c o n s t _ ...
-
[39]
c o n s t _ e x p r ( e p i _ p a r a m s
-> E p i l o g u e T e n s o r s S M e m : 400 401if cutlass . c o n s t _ e x p r ( e p i _ p a r a m s . mP os tA ct is not None ) : 402sP os tA ct = storage . sP os tA ct . g e t _ t e n s o r ( 403e p i _ p a r a m s . e p i _ s m e m _ l a y o u t _ s t a g e d . outer , 404swizzle = e p i _ p a r a m s . e p i _ s m e m _ l a y o u t _ s t a g e d ....
-
[40]
c o n s t _ e x p r ( e pi _ar gs
-> tuple [ int , int , int ]: 427e p i _ s m e m _ b y t e s _ f i x e d = 0 428e p i _ s m e m _ b y t e s _ p e r _ s t a g e _ c s t = 0 429e p i _ s m e m _ b y t e s _ p e r _ s t a g e _ p l d = 0 430 431if cutlass . c o n s t _ e x p r ( e pi _ar gs . mPo st Ac t is not None ) : 432mP os tA ct = m i s c _ u t i l s . s t a t i c _ a s s e r t _ i s...
-
[41]
, E p i l o g u e V i s i t o r T r e e ] , 467E p i l o g u e V i s i t o r T r e e
-> tuple [ 466C al lab le [... , E p i l o g u e V i s i t o r T r e e ] , 467E p i l o g u e V i s i t o r T r e e . E p i l o g u e A r g u m e n t s , 468dict , 469tuple , 470]: 471" " " Prepare ep il og ue for GEMM with residual , partial mean - of - squares , and 472fused per - N RMSNorm - weight scaling - mirrors t r a i n s t a t i o n ’s ‘ g e m m...
-
[42]
E V T R e s i d u a l : D = acc + C
-
[43]
E V T C o l B l o c k R e d u c t i o n S t o r e : S [m , nb ] = mean ( D [m , nb * bs :( nb +1) * bs ]^2)
-
[44]
tRS_rD is p r e s e r v e d 481so the main D output is also u ns cal ed
E V T R o w V e c M u l P o s t A c t ( local ) : O [m , n ] = D [m , n ] * W [ n ] , side output via TMA 478 479The partial sum - of - squares is c om put ed on the * un sc ale d * D , so a d o w n s t r e a m 480rstd r e d u c t i o n sees the GEMM output before W is applied . tRS_rD is p r e s e r v e d 481so the main D output is also u ns cal ed . 482...
-
[45]
, 506E V T C o l B l o c k R e d u c t i o n S t o r e ( 507r e d u c t i o n _ o p = _ c r e a t e _ m e a n _ s q _ r e d u c t i o n _ o p ( 508e l e m e n t _ t y p e = acc_dtype , 509i n v _ b l o c k _ s i z e =1.0 / t i l e _ s h a p e _ m n k [1] ,
-
[46]
, 511t i l e _ s h a p e _ m n k = tile_shape_mnk ,
-
[47]
, 513E V T R o w V e c M u l P o s t A c t ( 514a c c _ d t y p e = acc_dtype , 515p o s t _ a c t _ d t y p e = post_act_dtype , 516t i l e _ s h a p e _ m n k = tile_shape_mnk , 517b u f f e r _ a l i g n _ b y t e s = b u f f e r _ a l i g n _ b y t e s ,
-
[48]
E p i l o g u e A r g u m e n t s ([ 522E V T R e s i d u a l
, 519]) 520 521e pi _ar gs = EVTList . E p i l o g u e A r g u m e n t s ([ 522E V T R e s i d u a l . E p i l o g u e A r g u m e n t s ( 523mMatrix =C ,
-
[49]
E p i l o g u e A r g u m e n t s ( 526mColVec =S ,
, 525E V T C o l B l o c k R e d u c t i o n S t o r e . E p i l o g u e A r g u m e n t s ( 526mColVec =S ,
-
[50]
E p i l o g u e A r g u m e n t s ( 529mP os tA ct =O , 530mRowVec =W ,
, 528E V T R o w V e c M u l P o s t A c t . E p i l o g u e A r g u m e n t s ( 529mP os tA ct =O , 530mRowVec =W ,
-
[51]
, 532]) 533 534e pi _ke ys = ( 535C . dtype , 536S . dtype , 537W . dtype , 538O . dtype , 539EVTResidual , 540E V T C o l B l o c k R e d u c t i o n S t o r e , 541E V T R o w V e c M u l P o s t A c t , 542) 543 544e pi _ou ts = {} 545 546return epi_cls , epi_args , epi_outs , ep i_k ey s Listing 2: Kernel Example. C Experiments C.1 List of Kernels We ...
-
[52]
FlashInfer 0.6.10.post1
-
[53]
QuACK Kernels 0.4.1 23
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.