GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations
Pith reviewed 2026-05-15 20:25 UTC · model grok-4.3
The pith
GPU memory and utilization estimators face persistent limitations in generalization and overhead that hinder reliable collocation of deep learning training tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that while each estimator paradigm offers some benefits, analytical models cannot easily extend to new GPU architectures or reflect optimization savings, CPU libraries impose intrusive overhead, and both analytical and ML-based methods rely on model specifications limiting cross-architecture use; utilization estimation is complicated by non-additive metrics and heterogeneity, with challenges remaining despite the proposed lightweight ML estimator and dataset release.
What carries the argument
A comparative evaluation framework using synthetic datasets of varied neural network architectures (MLPs, CNNs, Transformers) to test memory prediction accuracy and generalizability of Horus, PyTorch FakeTensor, and custom ML estimators.
Load-bearing premise
The synthetic dataset with controlled variations in MLPs, CNNs, and Transformers plus validation on real-world unseen models sufficiently captures the diversity needed to demonstrate generalizability and limitations of the estimators.
What would settle it
Demonstrating an analytical or ML estimator that maintains low prediction error on new GPU generations and optimization-enabled models without requiring model specifications or computation graphs would falsify the claimed inherent limitations.
Figures
read the original abstract
Collocating deep learning training tasks improves GPU utilization but risks resource contention, severe slowdowns, and out-of-memory (OOM) failures. Accurate memory estimation is essential for robust collocation, and GPU utilization estimation -- a key proxy for contention -- enables interference-aware scheduling. Existing GPU memory estimators span three paradigms -- analytical models, CPU-side libraries, and ML-based estimators -- each with distinct limitations: dependence on detailed model specifications, intrusive integration, poor generalization, and varying latency overhead. GPU heterogeneity further complicates estimation, as identical tasks can exhibit different memory footprints across hardware generations. GPU utilization remains comparatively understudied, further complicated by non-additive utilization metrics and GPU heterogeneity. We conduct a systematic analysis of representative estimators from each paradigm -- Horus, PyTorch FakeTensor, and our lightweight ML-based estimator -- evaluating accuracy, generalizability, and overhead. We construct a synthetic dataset spanning MLPs, CNNs, and Transformers with controlled architectural variations, and train MLP- and Transformer-based estimators for memory prediction, and experiment with utilization estimation. Our evaluation reveals key tradeoffs and validates estimators against real-world unseen models. Significant challenges remain: analytical models lack generalization and cannot easily be extended to new GPU architectures or accurately reflect memory optimization savings; CPU-side libraries impose intrusive integration overhead; and both analytical and ML-based estimators rely on model specifications or computation graphs, limiting generalization across diverse architectures and hardware variants. We release all datasets, tools, and artifacts to support further research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes GPU memory and utilization estimation techniques for collocating deep learning training tasks on GPUs. It evaluates three paradigms—analytical models like Horus, CPU-side libraries such as PyTorch FakeTensor, and ML-based estimators—using a synthetic dataset spanning MLPs, CNNs, and Transformers with controlled variations, followed by validation on real-world unseen models. The paper identifies limitations in generalization across architectures and hardware, integration overhead, and reliance on model specifications, while releasing all datasets, tools, and artifacts.
Significance. If the results hold, the work provides valuable insights into the tradeoffs and shortcomings of existing estimators, which are critical for improving resource management, reducing OOM failures, and enabling better collocation in GPU clusters. The emphasis on GPU heterogeneity and the release of artifacts enhance its utility for the community.
major comments (1)
- [Evaluation] Evaluation section: The central claim that analytical, library, and ML-based estimators lack generalization (and cannot easily extend to new architectures or reflect optimizations) rests on evaluation using a synthetic dataset of MLPs/CNNs/Transformers plus validation on real-world unseen models. Without quantitative measures of distributional shift between the synthetic and validation sets (e.g., in layer types, computation graphs, or memory patterns), the accuracy drops and limitations observed may reflect dataset construction rather than fundamental estimator shortcomings. This is load-bearing for the conclusions about poor cross-architecture generalization.
minor comments (1)
- [Abstract] Abstract: The evaluation plan is outlined clearly but lacks specific accuracy metrics, statistical details, or full methodology, which would help readers assess the strength of the reported tradeoffs and validation results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our evaluation methodology. We address the major comment below and will revise the manuscript to strengthen the claims with additional quantitative analysis.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: The central claim that analytical, library, and ML-based estimators lack generalization (and cannot easily extend to new architectures or reflect optimizations) rests on evaluation using a synthetic dataset of MLPs/CNNs/Transformers plus validation on real-world unseen models. Without quantitative measures of distributional shift between the synthetic and validation sets (e.g., in layer types, computation graphs, or memory patterns), the accuracy drops and limitations observed may reflect dataset construction rather than fundamental estimator shortcomings. This is load-bearing for the conclusions about poor cross-architecture generalization.
Authors: We agree that quantitative measures of distributional shift would strengthen the evaluation and help demonstrate that the observed limitations reflect fundamental estimator shortcomings rather than artifacts of dataset construction. In the revised manuscript, we will add a dedicated subsection in the Evaluation section that compares the synthetic training set and real-world validation models along dimensions including layer type frequencies, computation graph statistics (depth, branching factor, operator diversity), and memory footprint distributions. We will report statistical distances (e.g., total variation distance on categorical features and Wasserstein distance on continuous memory patterns) together with supporting visualizations. This addition will make the generalization claims more robust while preserving the original experimental design. revision: yes
Circularity Check
No significant circularity in derivation or claims
full rationale
The paper constructs a synthetic dataset spanning MLPs, CNNs, and Transformers, trains its own MLP- and Transformer-based estimators on it, and evaluates accuracy/generalizability against real-world unseen models using external tools (Horus, PyTorch FakeTensor). Conclusions about estimator limitations are empirical outcomes of these comparisons rather than any self-definitional reduction, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations or steps reduce by construction to the paper's inputs; the work releases artifacts for independent verification and remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The synthetic dataset spanning MLPs, CNNs, and Transformers with controlled architectural variations adequately represents real-world model diversity for testing generalizability.
Reference graph
Works this paper leans on
-
[1]
Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irv- ing, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Va- sudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng
-
[2]
TensorFlow: A system for large-scale machine learning. InOSDI
-
[3]
Hadeel Albahar, Shruti Dongare, Yanlin Du, Nannan Zhao, Arnab K. Paul, and Ali R. Butt. 2022. SchedTune: A Heterogeneity-Aware GPU Scheduler for Deep Learning. In2022 22nd IEEE International Sym- posium on Cluster, Cloud and Internet Computing (CCGrid). 695–705. doi:10.1109/CCGrid54584.2022.00079
-
[4]
PyTorch Contributors. 2025. Fake Tensor Mode in PyTorch.https: //pytorch.org/docs/stable/torch.compiler_fake_tensor.htmlAccessed: 2025-06-01
work page 2025
-
[5]
NVIDIA Corporation. [n. d.]. NVIDIA System Management User Guide.https://docs.nvidia.com/datacenter/nvsm/nvsm-user-guide/ latest/nvsm-user-guide.pdf. Accessed: 2025-06-01
work page 2025
-
[6]
NVIDIA Corporation. 2024. NVIDIA DCGM.https://developer.nvidia. com/dcgm. Accessed: 2026-02-09
work page 2024
-
[7]
Ubuntu Documentation. [n. d.]. Ubuntu Manuals top command.https: //manpages.ubuntu.com/manpages/focal/man1/top.1.html. Accessed: 2025-06-01
work page 2025
-
[8]
Yanjie Gao, Xianyu Gu, Hongyu Zhang, Haoxiang Lin, and Mao Yang
-
[9]
Runtime performance prediction for deep learning models with graph neural network. In2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 368–380
-
[10]
Yanjie Gao, Yichen He, Xinze Li, Bo Zhao, Haoxiang Lin, Yoyo Liang, Jing Zhong, Hongyu Zhang, Jingzhou Wang, Yonghua Zeng, et al
-
[11]
InProceedings of the IEEE/ACM 46th International Conference on Software Engineering
An Empirical Study on Low GPU Utilization of Deep Learning Jobs. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13
-
[12]
Yanjie Gao, Yu Liu, Hongyu Zhang, Zhengxian Li, Yonghao Zhu, Haox- iang Lin, and Mao Yang. 2020. Estimating GPU memory consumption of deep learning models. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1342–1352
work page 2020
-
[13]
Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, unjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. InPro- ceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference(Renton, WA, USA)(USENIX ATC ’19). USENIX Association, USA, 947–960
work page 2019
- [14]
-
[15]
Microsoft DeepSpeed Team. 2023. DeepSpeed Memory Requirements. https://deepspeed.readthedocs.io/en/latest/memory.html. Accessed: GPU Memory and Utilization Estimation for Training-Aware Resource Management February 2026
work page 2023
-
[16]
Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. 2020. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 481–498.https://www.usenix.org/conference/ osdi20/presentation/narayanan-deepak
work page 2020
-
[17]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala
-
[18]
PyTorch: An Imperative Style, High-Performance Deep Learning Library. InNIPS. 8026–8037
-
[19]
Ties Robroek, Ehsan Yousefzadeh-Asl-Miandoab, and Pınar Tözün
-
[20]
InProceedings of the 4th Workshop on Machine Learning and Systems
An Analysis of Collocation on GPUs for Deep Learning Training. InProceedings of the 4th Workshop on Machine Learning and Systems. 81–90
-
[21]
International Journal of Computer Vision 131(1), 284–301 (2023)
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV)115, 3 (2015), 211–252. doi:10.1007/s11263- 015-0816-y
-
[22]
Jiabo Shi, Dimitrios Pezaros, and Yehia Elkhatib. 2025. xMem: A CPU- Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads. InProceedings of the 26th International Middleware Conference(Vanderbilt University, Nashville, TN, USA) (Middleware ’25). Association for Computing Machinery, New York, NY, USA, 256–269. doi:10.114...
-
[23]
Foteini Strati, Xianzhe Ma, and Ana Klimovic. 2024. Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications. InProceedings of the Nineteenth European Conference on Computer Systems. 1075–1092
work page 2024
-
[24]
Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. 2022. MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Het- erogeneous GPU Clusters. In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 945–960
work page 2022
-
[25]
Ross Wightman. 2019. PyTorch Image Models.https://github.com/ rwightman/pytorch-image-models. doi:10.5281/zenodo.4414861
-
[26]
Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective Cluster Scheduling for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI 18). USENIX Association, Carlsbad...
work page 2018
-
[27]
Gingfung Yeung, Damian Borowiec, Renyu Yang, Adrian Friday, Richard Harper, and Peter Garraghan. 2022. Horus: Interference- Aware and Prediction-Based Scheduling in Deep Learning Systems. IEEE Transactions on Parallel and Distributed Systems33, 1 (2022), 88–100. doi:10.1109/TPDS.2021.3079202
-
[28]
Ehsan Yousefzadeh-Asl-Miandoab, Ties Robroek, and Pinar Tözün
-
[29]
Decentralized learning made easy with decentralizepy
Profiling and Monitoring Deep Learning Training Tasks. In Proceedings of the 3rd Workshop on Machine Learning and Systems, EuroMLSys 2023, Rome, Italy, 8 May 2023, Eiko Yoneki and Luigi Nardi (Eds.). ACM, 18–25. doi:10.1145/3578356.3592589 Yousefzadeh-Asl-Miandoab et al. 7 Appendix −2 0 2 4 6 8 PC1 −4 −2 0 2 4 PC2 0 1 2 SMACT Class SMACT utilization (%) 0...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.