PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction
Pith reviewed 2026-05-16 12:42 UTC · model grok-4.3
The pith
PipeWeave blends analytical quantification of GPU pipeline demands with machine learning to predict kernel performance across hardware generations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PipeWeave first employs an analytical model to quantify a given kernel's demands on the GPU's heterogeneous instruction pipelines. These analytical features are then fed into a machine learning model to capture complex cross-pipeline interactions and resource dependencies, enabling high-fidelity performance prediction. It achieves 6.1% average error at the kernel level and 8.5% for end-to-end inference, reducing the error of state-of-the-art methods by 6.7x and 4.4x respectively, and it can guide optimizations such as a 1.7x speedup on a production fused MoE Triton kernel.
What carries the argument
The PipeWeave hybrid: an analytical front-end that counts kernel demands on heterogeneous instruction pipelines, whose outputs become input features for a machine-learning back-end that models cross-pipeline interactions.
If this is right
- Performance ceilings from the model can be compared against actual kernel runtimes to locate implementation inefficiencies in production code.
- The same framework supports both single-kernel forecasts and full end-to-end inference latency estimates inside serving systems.
- Predictions remain accurate across 11 GPUs from four distinct architecture generations without retraining the analytical component.
Where Pith is reading between the lines
- The separation of analytical feature extraction from learned interaction modeling may let the same pipeline features transfer to performance questions on non-GPU accelerators that share similar heterogeneous execution resources.
- Because the analytical stage is interpretable, the framework could be used to rank candidate kernel rewrites before any hardware measurement occurs.
- Extending the analytical front-end to capture memory-hierarchy effects beyond pipeline occupancy would likely further reduce residual error on memory-bound kernels.
Load-bearing premise
The analytical model correctly measures kernel demands on the GPU's separate instruction pipelines in a way that gives the machine-learning stage enough information to learn all relevant interactions even on hardware it has never seen.
What would settle it
Running the model on a GPU from a fifth architecture generation outside the four tested and checking whether average kernel-level prediction error remains under 10 percent.
Figures
read the original abstract
The rapid expansion of Transformer-based large language models has dramatically increased the need for high-performance GPUs. As a result, there is growing demand for fast, accurate, and widely generalizable GPU performance models to support next-generation hardware selection and system-level exploration. However, current data-driven methods are limited, exhibiting poor generalization across hardware and inadequate modeling of complex production-level kernels common in modern inference stacks. To address these issues, we present PipeWeave, a unified GPU modeling framework. This approach first employs an analytical model to quantify a given kernel's demands on the GPU's heterogeneous instruction pipelines. These analytical features are then fed into a machine learning (ML) model to capture complex cross-pipeline interactions and resource dependencies, enabling high-fidelity performance prediction. Our evaluation across 11 GPU types from four generations of major architectures on two widely-used serving systems demonstrates that PipeWeave delivers high fidelity and strong generalizability. It achieves accurate predictions, with only 6.1% average error at the kernel level and 8.5% for end-to-end inference -- reducing the error of state-of-the-art methods by 6.7x and 4.4x, respectively. We also demonstrate PipeWeave's value "beyond simulation" by utilizing its performance ceiling to diagnose implementation shortcomings and guide the optimization of a production fused MoE Triton kernel, achieving up to 1.7x speedup. Code is available https://github.com/zksainx/pipeweave.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PipeWeave, a hybrid GPU performance prediction framework. An analytical front-end first quantifies a kernel's demands on heterogeneous instruction pipelines; these features are then supplied to an ML model that captures cross-pipeline interactions and resource dependencies. Evaluation on 11 GPUs spanning four generations and two serving systems reports 6.1% average kernel-level error and 8.5% end-to-end inference error, claimed to be 6.7× and 4.4× lower than state-of-the-art methods. The model is further applied to diagnose and optimize a production fused MoE Triton kernel, yielding up to 1.7× speedup.
Significance. If the accuracy and cross-generation generalizability hold, the work would be a useful contribution to GPU performance modeling for large-scale inference workloads. The hybrid design addresses a recognized weakness of pure data-driven predictors (poor extrapolation to new hardware) while retaining the interpretability of analytical features; the demonstrated use for kernel optimization provides a concrete systems-level payoff.
major comments (3)
- [§3] §3 (Analytical Model): The description of how the analytical stage quantifies demands on heterogeneous pipelines is high-level only; no equations, pseudocode, or parameterization details are supplied for architecture-specific quantities such as per-pipeline throughput, occupancy, or memory-bandwidth scaling. Without these, it is impossible to verify that the extracted features remain sufficient for the downstream ML model on truly unseen GPU generations, which is load-bearing for the 6.7×/4.4× improvement claim.
- [§5] §5 (Evaluation): The reported error figures and improvement factors are given without specifying the exact training/test splits, whether any of the 11 GPUs were held out during ML training, or the precise configurations of the SOTA baselines. This information is required to substantiate the generalizability assertion across four generations.
- [§5.3] §5.3 (Ablation): No ablation isolating the analytical features from the ML component is presented. Consequently, it cannot be determined whether the hybrid synergy is necessary for the observed accuracy or whether a pure ML model with richer inputs would suffice, weakening the central methodological claim.
minor comments (2)
- The GitHub repository is referenced but the manuscript does not indicate whether the analytical-model implementation, feature-extraction scripts, and trained ML weights are included, which would aid reproducibility.
- Figure captions and axis labels in the evaluation section use inconsistent terminology for 'kernel-level' versus 'end-to-end' metrics; standardize for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify key aspects of the presentation. We address each major point below and will revise the manuscript accordingly to improve clarity and substantiation of the claims.
read point-by-point responses
-
Referee: [§3] §3 (Analytical Model): The description of how the analytical stage quantifies demands on heterogeneous pipelines is high-level only; no equations, pseudocode, or parameterization details are supplied for architecture-specific quantities such as per-pipeline throughput, occupancy, or memory-bandwidth scaling. Without these, it is impossible to verify that the extracted features remain sufficient for the downstream ML model on truly unseen GPU generations, which is load-bearing for the 6.7×/4.4× improvement claim.
Authors: We agree that §3 would benefit from greater detail. In the revised manuscript we will expand the analytical model section with explicit equations for per-pipeline throughput (derived from instruction mix and pipeline widths), occupancy estimation, and memory-bandwidth scaling factors. Parameterization will be described using vendor architecture specifications and microbenchmark-derived constants. Pseudocode for the full feature-extraction pipeline will be added as an appendix. These additions will make the portability of the features across generations explicit and allow independent verification of the reported accuracy gains. revision: yes
-
Referee: [§5] §5 (Evaluation): The reported error figures and improvement factors are given without specifying the exact training/test splits, whether any of the 11 GPUs were held out during ML training, or the precise configurations of the SOTA baselines. This information is required to substantiate the generalizability assertion across four generations.
Authors: We acknowledge the omission of explicit protocol details. The revision will add a dedicated paragraph in §5 describing the exact training/test splits (including which GPUs from each generation were held out for cross-generation testing), the cross-validation procedure used, and the precise configurations, hyperparameters, and training regimes of all SOTA baselines. This will allow readers to reproduce the 6.7× and 4.4× error reductions and confirm that the ML component was never trained on the held-out test GPUs. revision: yes
-
Referee: [§5.3] §5.3 (Ablation): No ablation isolating the analytical features from the ML component is presented. Consequently, it cannot be determined whether the hybrid synergy is necessary for the observed accuracy or whether a pure ML model with richer inputs would suffice, weakening the central methodological claim.
Authors: We agree that an explicit ablation is needed to substantiate the hybrid design. The revised §5.3 will include a new ablation study comparing (1) the full PipeWeave model, (2) a pure ML model receiving the same analytical features, and (3) a pure ML model supplied with richer raw hardware counters. Results will quantify the contribution of the analytical front-end to both accuracy and cross-generation generalization, directly addressing whether the hybrid approach is required for the observed performance. revision: yes
Circularity Check
Analytical features supplied as independent inputs to ML stage; no reduction of predictions to fitted quantities or self-citation chains
full rationale
The derivation begins with an analytical stage that quantifies kernel demands on heterogeneous instruction pipelines and supplies those quantities as features to a downstream ML model. This structure does not define the target performance metric in terms of the ML outputs, nor does it fit parameters on a subset and relabel the result as a prediction. No equations or claims in the provided text reduce the reported 6.1 % kernel-level or 8.5 % end-to-end errors to quantities that are tautological with the fitted inputs. Self-citation load-bearing, uniqueness importation, or ansatz smuggling are not exhibited. The approach therefore remains self-contained against external benchmarks and receives only a minor score for the ordinary presence of an analytical front-end.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Analytical model accurately quantifies a kernel's demands on the GPU's heterogeneous instruction pipelines
Forward citations
Cited by 1 Pith paper
-
WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning
WaveTune introduces a wave-aware bilinear latency predictor and wave-structured sparse sampling to enable fast runtime auto-tuning of GPU kernels, achieving up to 1.83x kernel speedup and 1.33x TTFT reduction with dra...
Reference graph
Works this paper leans on
-
[1]
Vidur: A large-scale simulation framework for llm inference,
A. Agrawal, N. Kedia, J. Mohan, A. Panwar, N. Kwatra, B. S. Gulavani, R. Ramjee, and A. Tumanov, “Vidur: A large-scale simulation framework for llm inference,” inProceedings of the 2024 Conference on Machine Learning and Systems (MLSys ’24), 2024, also available at arXiv:2405.05465. [Online]. Available: https://arxiv.org/abs/2405.05465
-
[2]
A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, A. Tumanov, and R. Ramjee, “Taming throughput-latency tradeoff in llm inference with sarathi-serve,” inProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’24). Santa Clara, CA, USA: USENIX Association, 2024, also available on arXiv:2403.02310. ...
-
[3]
J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y . Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y . Zhang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Analyzing cuda workloads using a detailed gpu simulator,
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, “Analyzing cuda workloads using a detailed gpu simulator,” inIEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2009, pp. 163–174
work page 2009
-
[5]
J. Bang, Y . Choi, M. Kim, Y . Kim, and M. Rhu, “vTrain: A simulation framework for evaluating cost-effective and compute- optimal large language model training,” inProceedings of the 57th IEEE/ACM International Symposium on Microarchitecture (MICRO 2024). IEEE / ACM, 2024, pp. 153–167. [Online]. Available: https://arxiv.org/abs/2312.12391
-
[6]
Amali: An analytical model for accurately modeling llm inference on modern gpus,
S. Cao, J. Wu, J. Chen, H. An, and Z. Yu, “Amali: An analytical model for accurately modeling llm inference on modern gpus,” inProceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA ’25). ACM, 2025, pp. 1495–1508
work page 2025
-
[7]
Llmservingsim: A hw/sw co-simulation infrastructure for llm inference serving at scale,
J. Cho, M. Kim, H. Choi, G. Heo, and J. Park, “Llmservingsim: A hw/sw co-simulation infrastructure for llm inference serving at scale,” in2024 IEEE International Symposium on Workload Characterization (IISWC), 2024, pp. 1–12
work page 2024
-
[8]
A discourse-aware attention model for abstractive summarization of long documents,
A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, W. Chang, and N. Goharian, “A discourse-aware attention model for abstractive summarization of long documents,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL- HLT), Volume 2 (Short Papers). New Orle...
work page 2018
-
[9]
Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models,
D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wu, Z. Xie, Y . K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang, “Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). ...
work page 2024
-
[10]
FlashAttention-2: Faster attention with better parallelism and work partitioning,
T. Dao, “FlashAttention-2: Faster attention with better parallelism and work partitioning,” inInternational Conference on Learning Represen- tations (ICLR), 2024
work page 2024
-
[11]
FlashAttention: Fast and memory-efficient exact attention with IO-awareness,
T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “FlashAttention: Fast and memory-efficient exact attention with IO-awareness,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[12]
Taking gpu programming models to task for performance: an empirical study,
J. H. Daviset al., “Taking gpu programming models to task for performance: an empirical study,” inProceedings of ICS 2025, 2025, demonstrates that abstraction and language-level limitations cause persistent, architecture-dependent performance gaps. [Online]. Available: https://hpcrl.github.io/ICS2025- webpage/program/Proceedings ICS25/ics25-63.pdf
work page 2025
-
[13]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,
W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, pp. 1–39, 2022
work page 2022
-
[14]
Simulating the next generation of llm inference systems,
Y . Feng, X. Tan, K. H. Sew, Y . Jiang, Y . Zhu, and H. Xu, “Simulating the next generation of llm inference systems,” inProceedings of the 4th Workshop on Practical Adoption Challenges of ML for Systems (PACMI ’25). ACM, 2025
work page 2025
-
[15]
FlashInfer Team, “Kv cache layout tutorial,” https://docs.flashinfer.ai/ tutorials/kv layout.html, 2025, accessed: 2025-10-27
work page 2025
-
[16]
Gemini 2.5: Expanding the Capabilities of Mul- timodal AI Models,
Google DeepMind, “Gemini 2.5: Expanding the Capabilities of Mul- timodal AI Models,” https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/, 2025, accessed: Nov. 2025
work page 2025
-
[17]
Network simulations with the ns-3 simulator,
T. R. Henderson, M. Lacage, G. F. Riley, C. Dowell, and J. Kopena, “Network simulations with the ns-3 simulator,” inSIGCOMM Demonstration, 2008. [Online]. Available: https://www.nsnam.org/
work page 2008
-
[18]
An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness,
S. Hong and H. Kim, “An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness,” inProceedings of the 36th Annual International Symposium on Computer Architecture (ISCA ’09). ACM, 2009, pp. 152–163
work page 2009
-
[19]
Gpumech: Gpu performance modeling technique based on interval analysis,
J.-C. Huang, J. H. Lee, H. Kim, and H.-H. S. Lee, “Gpumech: Gpu performance modeling technique based on interval analysis,” in2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014, pp. 268–279
work page 2014
-
[20]
Dynamic thread block scheduling for gpu-based computing,
Y . Ji, W. Li, X. Shen, and X. Shen, “Dynamic thread block scheduling for gpu-based computing,” inProceedings of the 22nd International Con- ference on Parallel Architectures and Compilation Techniques (PACT ’13). IEEE, 2013, pp. 375–386
work page 2013
-
[21]
Owl: Cooperative thread array aware scheduling techniques for improving gpgpu performance,
A. Jog, P. Nadkarni, O. Kayiran, R. Das, M. Kandemir, O. Mutlu, V . Narayanan, and C. R. Das, “Owl: Cooperative thread array aware scheduling techniques for improving gpgpu performance,” inProceed- ings of the 43rd Annual International Symposium on Computer Archi- tecture (ISCA ’16). IEEE, 2016, pp. 395–406
work page 2016
-
[22]
Accel-sim: An extensible simulation framework for validated gpu modeling,
M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-sim: An extensible simulation framework for validated gpu modeling,” in 47th Annual International Symposium on Computer Architecture (ISCA). IEEE/ACM, 2020, pp. 473–486
work page 2020
-
[23]
R. Koenker and G. Bassett, “Regression quantiles,”Econometrica, vol. 46, no. 1, pp. 33–50, 1978
work page 1978
-
[24]
Fp8 quantization: The power of the exponent,
A. Kuzmin, M. van Baalen, Y . Ren, M. Nagel, J. Pe- ters, and T. Blankevoort, “Fp8 quantization: The power of the exponent,” inAdvances in Neural Information 12 Processing Systems 35 (NeurIPS 2022), 2022. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/2022/hash/ 5e07476b6bd2497e1fbd11b8f0b2de3c-Abstract-Conference.html
work page 2022
-
[25]
Gcom: a detailed gpu core model for accurate analytical modeling of modern gpus,
J. Lee, Y . Ha, S. Lee, J. Woo, J. Lee, H. Jang, and Y . Kim, “Gcom: a detailed gpu core model for accurate analytical modeling of modern gpus,” inProceedings of the 49th Annual International Symposium on Computer Architecture (ISCA ’22). Association for Computing Machinery, 2022, pp. 424–436
work page 2022
-
[26]
S. Lee, A. Phanishayee, and D. Mahajan, “Forecasting gpu performance for deep learning training and inference,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ser. ASPLOS ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 493–508. [Online]. Avai...
-
[27]
Deep Dive on CUTLASS Ping-Pong GEMM Ker- nel,
A. H. Less Wright, “Deep Dive on CUTLASS Ping-Pong GEMM Ker- nel,” https://pytorch.org/blog/cutlass-ping-pong-gemm-kernel/, Novem- ber 2024, accessed: 2025-10-18
work page 2024
-
[28]
Locality-aware cta clustering for modern gpus,
A. Li, S. L. Song, W. Liu, X. Liu, A. Kumar, and H. Corporaal, “Locality-aware cta clustering for modern gpus,” inProceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’17). Xi’an, China: ACM, 2017, pp. 297–311. [Online]. Available: https://doi.org/10.1145/3037697.3037709
-
[29]
Path forward beyond simulators: Fast and accurate gpu execution time prediction for dnn workloads,
Y . Li, Y . Sun, and A. Jog, “Path forward beyond simulators: Fast and accurate gpu execution time prediction for dnn workloads,” in Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 380–394. [Online]. Available: https://doi.org/10.1145/36...
-
[30]
Greedy dual-size thread block scheduling for gpus,
A. Liu, S. L. Song, W. Liu, A. Kumar, and H. Corporaal, “Greedy dual-size thread block scheduling for gpus,” inProceedings of the 42nd International Conference on Parallel Processing (ICPP ’13). IEEE, 2013, pp. 320–329
work page 2013
-
[31]
Locality analysis for gpgpu programs,
X. Liu, A. Li, J. Yang, A. Nukada, B. Ren, and W.-m. W. Hwu, “Locality analysis for gpgpu programs,” inProceedings of the International Symposium on Microarchitecture (MICRO ’12). IEEE, 2012, pp. 63–74
work page 2012
-
[32]
Decoupled weight decay regularization,
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations (ICLR), 2019. [Online]. Available: https://openreview.net/forum?id=Bkg6RiCqY7
work page 2019
-
[33]
Streaming Assembler (SASS) — GPU Glossary,
Modal Labs, “Streaming Assembler (SASS) — GPU Glossary,” https: //modal.com/gpu-glossary/device-software/streaming-assembler, 2025, accessed: 2025-10-20
work page 2025
-
[34]
Gpu parallel computing architecture and cuda programming model,
J. Nickolls, “Gpu parallel computing architecture and cuda programming model,” in2007 IEEE Hot Chips 19 Symposium (HCS), 2007, pp. 1–12
work page 2007
- [35]
-
[36]
NVIDIA A100 Tensor Core GPU Architecture In- Depth
——,NVIDIA Ampere Architecture Whitepaper (GA10x/A100), 2020, “NVIDIA A100 Tensor Core GPU Architecture In- Depth” and “NVIDIA Ampere GA102 GPU Architecture” Whitepapers. [Online]. Available: https://www.nvidia.com/content/PDF/ nvidia-ampere-architecture-whitepaper.pdf
work page 2020
-
[37]
——,NVIDIA Ada GPU Architecture Whitepaper (Ada Lovelace), 2022, “NVIDIA Ada GPU Architecture” V2.02. [Online]. Avail- able: https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia- ada-gpu-architecture.pdf
work page 2022
-
[38]
NVIDIA H100 Tensor Core GPU Architecture
——,NVIDIA Hopper GPU Architecture Whitepaper (H100 Tensor Core GPU), 2022, “NVIDIA H100 Tensor Core GPU Architecture” Whitepaper V1.01. [Online]. Available: https://advancedclustering.com/ wp-content/uploads/2022/03/gtc22-whitepaper-hopper.pdf
work page 2022
-
[39]
——, “Cuda gpus,” 2024. [Online]. Available: https://developer.nvidia. com/cuda-gpus
work page 2024
- [40]
- [41]
-
[42]
——, “Efficient gemm in cutlass,” https://docs.nvidia.com/cutlass/media/ docs/cpp/efficient gemm.html, oct 2024, accessed: 2025-10-27. CUT- LASS Documentation
work page 2024
-
[43]
——, “Matrix multiplication,” https://docs.nvidia.com/deeplearning/ performance/dl-performance-matrix-multiplication/index.html, oct 2024, accessed: 2025-10-27. Part of the NVIDIA Deep Learning Performance Guide
work page 2024
-
[44]
NVIDIA RTX Blackwell GPU Architecture
——,NVIDIA Blackwell Architecture Whitepaper (RTX/AI Data- Center), 2024, “NVIDIA RTX Blackwell GPU Architecture” Whitepa- per V1.1. [Online]. Available: https://images.nvidia.com/aem-dam/ Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf
work page 2024
- [45]
- [46]
- [47]
-
[48]
CUDA Compiler Driver NVCC Documentation,
——, “CUDA Compiler Driver NVCC Documentation,” https://docs. nvidia.com/cuda/cuda-compiler-driver-nvcc/, 2025, accessed: 2025-10- 20
work page 2025
- [49]
-
[50]
——, “CUTLASS Documentation,” https://docs.nvidia.com/cutlass/ index.html, 2025, accessed: 2025-10-18
work page 2025
-
[51]
NVIDIA cuBLAS Library Documentation,
——, “NVIDIA cuBLAS Library Documentation,” https://docs.nvidia. com/cuda/cublas/, 2025, accessed: 2025-10-18
work page 2025
-
[52]
——, “NVIDIA Developer Forums,” https://forums.developer.nvidia. com, 2025, accessed: 2025-10-20
work page 2025
-
[53]
NVIDIA Nsight Compute Documentation,
——, “NVIDIA Nsight Compute Documentation,” https://docs.nvidia. com/nsight-compute, 2025, accessed: 2025-10-20
work page 2025
-
[54]
Parallel Thread Execution ISA Version 9.0 Documentation,
——, “Parallel Thread Execution ISA Version 9.0 Documentation,” https://docs.nvidia.com/cuda/parallel-thread-execution/, 2025, accessed: 2025-10-20
work page 2025
-
[55]
Llmcompass: Enabling efficient hardware design for large language model inference,
P. Patel, E. Choukse, C. Zhang, A. Shah, ´I. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient generative llm inference using phase splitting,” inProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA), Buenos Aires, Argentina, 2024. [Online]. Available: https://dl.acm.org/doi/10.1109/ISCA59077.2024.00019
-
[56]
Pytorch profiler: Performance analysis tool for deep learning,
PyTorch Team, “Pytorch profiler: Performance analysis tool for deep learning,” https://pytorch.org/docs/stable/profiler.html, 2024, accessed: 2025-11-04
work page 2024
-
[57]
Astra-sim: En- abling sw/hw co-design exploration for distributed deep learning training platforms,
S. Rashidi, S. Sridharan, S. Srinivasan, and T. Krishna, “Astra-sim: En- abling sw/hw co-design exploration for distributed deep learning training platforms,” in2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2020, pp. 81–92
work page 2020
-
[58]
Gpu performance portability needs autotuning,
B. Ringlein, T. Parnell, and R. Stoica, “Gpu performance portability needs autotuning,”arXiv preprint, 2025, shows that residual performance gaps often stem from fundamental kernel design limits rather than parameter tuning alone. [Online]. Available: https://arxiv.org/abs/2505. 03780
work page 2025
-
[59]
SGLang: Fast Serving Framework for Large Language Models and Vision-Language Models,
SGLang Project, “SGLang: Fast Serving Framework for Large Language Models and Vision-Language Models,” https://github.com/sgl-project/ sglang, 2024, version 0.5.3, Apache License 2.0
work page 2024
-
[60]
arXiv preprint arXiv:2407.08608 , year=
J. Shah, G. Bikshandi, Y . Zhang, V . Thakkar, P. Ramani, and T. Dao, “FlashAttention-3: Fast and Accurate Attention with Asyn- chrony and Low-precision,” https://arxiv.org/abs/2407.08608, July 2024, arXiv:2407.08608 [cs.LG]
-
[61]
GLU Variants Improve Transformer
N. Shazeer, “Glu variants improve transformer,”arXiv preprint arXiv:2002.05202, 2020. [Online]. Available: https://arxiv.org/abs/2002. 05202
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[62]
Outrageously large neural networks: The sparsely- gated mixture-of-experts layer,
N. Shazeeret al., “Outrageously large neural networks: The sparsely- gated mixture-of-experts layer,” inInternational Conference on Learning Representations (ICLR), 2017
work page 2017
-
[63]
Efficient post-training quantization with fp8 formats,
H. Shen, N. Mellempudi, X. He, Q. Gao, C. Wang, and M. Wang, “Efficient post-training quantization with fp8 formats,” inProceedings of the 6th Conference on Machine Learning and Systems (MLSys 2024), 2024, arXiv preprint arXiv:2309.14592v2. [On- line]. Available: https://proceedings.mlsys.org/paper files/paper/2024/ hash/dea9b4b6f55ae611c54065d6fc750755-A...
-
[64]
Flexgen: High-throughput generative inference of large language models with a single gpu,
Y . Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. R ´e, I. Stoica, and C. Zhang, “Flexgen: High-throughput generative inference of large language models with a single gpu,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelha...
work page 2023
-
[65]
Understanding the impact of cta scheduling on gpu performance,
S. L. Song, A. Li, X. Liu, A. Kumar, and H. Corporaal, “Understanding the impact of cta scheduling on gpu performance,”IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 6, pp. 1738–1751, 2016
work page 2016
-
[66]
Mgpu- sim: Enabling multi-gpu performance modeling and optimization,
Y . Sun, T. Baruah, S. A. Mojumder, S. Dong, X. Gong, S. Treadway, Y . Bao, S. Hance, C. McCardwell, V . Zhao, and et al., “Mgpu- sim: Enabling multi-gpu performance modeling and optimization,” in Proceedings of the 46th Annual International Symposium on Computer Architecture (ISCA). ACM, 2019, pp. 197–209
work page 2019
-
[67]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, ´Edouard Grave, and G. Lample, “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023, arXiv:2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
Triton Language Documentation,
Triton Team, “Triton Language Documentation,” https://triton-lang.org/ main/index.html, 2025, accessed: 2025-10-20
work page 2025
-
[69]
An Overview of the OMNeT++ Simulation Environment,
A. Varga and R. Hornig, “An Overview of the OMNeT++ Simulation Environment,” inProceedings of the 1st International Conference on Simulation Tools and Techniques for Communications, Networks and Systems, ser. SIMUTOOLS ’08. ICST, 2008, pp. 1–10
work page 2008
-
[70]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inProceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 6000–6010
work page 2017
-
[71]
CUTLASS Tutorial: Persistent Kernels and Stream-K,
A. Vladimirov, “CUTLASS Tutorial: Persistent Kernels and Stream-K,” https://research.colfax-intl.com/cutlass-tutorial-persistent-kernels-and- stream-k/, 2024, accessed: 2025-10-18
work page 2024
-
[72]
vLLM: A High-Throughput and Memory-Efficient Inference and Serving Engine for Large Language Models,
vLLM Project, “vLLM: A High-Throughput and Memory-Efficient Inference and Serving Engine for Large Language Models,” https: //github.com/vllm-project/vllm, 2025, version 0.11.0 (latest Oct 2 2025), Apache License 2.0
work page 2025
-
[73]
X. Wang, Q. Li, Y . Xu, G. Lu, D. Li, L. Chen, H. Zhou, L. Zheng, S. Zhang, Y . Zhu, Y . Liu, P. Zhang, K. Qian, K. He, J. Gao, E. Zhai, D. Cai, and B. Fu, “Simai: Unifying architecture design and performance tuning for large-scale large language model training with scalability and precision,” inProceedings of the 22nd USENIX Symposium on Networked System...
work page 2025
-
[74]
S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,”Commun. ACM, vol. 52, no. 4, p. 65–76, Apr. 2009. [Online]. Available: https://doi.org/10.1145/1498765.1498785
-
[75]
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Z. Ye, L. Chen, R. Lai, W. Lin, Y . Zhang, S. Wang, T. Chen, B. Kasikci, V . Grover, A. Krishnamurthy, and L. Ceze, “Flashinfer: Efficient and customizable attention engine for llm inference serving,” arXiv preprint arXiv:2501.01005, 2025. [Online]. Available: https: //arxiv.org/abs/2501.01005
work page internal anchor Pith review arXiv 2025
-
[76]
Habitat: A runtime- based computational performance predictor for deep neural network training,
G. X. Yu, Y . Gao, P. Golikov, and G. Pekhimenko, “Habitat: A runtime- based computational performance predictor for deep neural network training,” inUSENIX Annual Technical Conference, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:236992542
work page 2021
-
[77]
Root Mean Square Layer Normalization
B. Zhang and R. Sennrich, “Root mean square layer normalization,” CoRR, vol. abs/1910.07467, 2019. [Online]. Available: http://arxiv.org/ abs/1910.07467
work page internal anchor Pith review arXiv 1910
-
[78]
Llmcompass: Enabling efficient hardware design for large language model inference,
H. Zhang, A. Ning, R. B. Prabhakar, and D. Wentzlaff, “Llmcompass: Enabling efficient hardware design for large language model inference,” inProceedings of the 51st Annual International Symposium on Computer Architecture, ser. ISCA ’24. IEEE Press, 2025, p. 1080–1096. [Online]. Available: https://doi.org/10.1109/ISCA59077.2024.00082
-
[79]
Tlp-aware cooperative scheduling for efficient gpu memory system utilization,
J. Zhang and A. Jog, “Tlp-aware cooperative scheduling for efficient gpu memory system utilization,” inProceedings of the 44th Annual International Symposium on Computer Architecture (ISCA ’17). ACM, 2017, pp. 93–104
work page 2017
-
[80]
Daydream: Accurately estimating the efficacy of optimizations for dnn training,
H. Zhu, A. Phanishayee, and G. Pekhimenko, “Daydream: Accurately estimating the efficacy of optimizations for dnn training,” inProceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC). USENIX Association, 2020, pp. 337–352. [Online]. Available: https://www.usenix.org/conference/atc20/presentation/zhu-hongyu 14
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.