pith. sign in

arxiv: 2602.17817 · v3 · submitted 2026-02-19 · 💻 cs.DC

GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations

Pith reviewed 2026-05-15 20:25 UTC · model grok-4.3

classification 💻 cs.DC
keywords GPU memory estimationutilization predictiondeep learning trainingresource collocationGPU heterogeneityanalytical modelsML-based estimatorssynthetic dataset
0
0 comments X

The pith

GPU memory and utilization estimators face persistent limitations in generalization and overhead that hinder reliable collocation of deep learning training tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper systematically evaluates three paradigms of GPU memory estimators—analytical models, CPU-side libraries, and machine learning-based predictors—along with utilization estimation, to assess their suitability for managing collocated training workloads. It constructs a synthetic dataset covering MLPs, CNNs, and Transformers with controlled variations and validates against real unseen models. The evaluation highlights tradeoffs in accuracy, generalization to new architectures, integration overhead, and handling of memory optimizations or hardware differences. These insights matter because poor estimation leads to out-of-memory errors or underutilized GPUs when trying to run multiple tasks together on the same hardware.

Core claim

The central discovery is that while each estimator paradigm offers some benefits, analytical models cannot easily extend to new GPU architectures or reflect optimization savings, CPU libraries impose intrusive overhead, and both analytical and ML-based methods rely on model specifications limiting cross-architecture use; utilization estimation is complicated by non-additive metrics and heterogeneity, with challenges remaining despite the proposed lightweight ML estimator and dataset release.

What carries the argument

A comparative evaluation framework using synthetic datasets of varied neural network architectures (MLPs, CNNs, Transformers) to test memory prediction accuracy and generalizability of Horus, PyTorch FakeTensor, and custom ML estimators.

Load-bearing premise

The synthetic dataset with controlled variations in MLPs, CNNs, and Transformers plus validation on real-world unseen models sufficiently captures the diversity needed to demonstrate generalizability and limitations of the estimators.

What would settle it

Demonstrating an analytical or ML estimator that maintains low prediction error on new GPU generations and optimization-enabled models without requiring model specifications or computation graphs would falsify the claimed inherent limitations.

Figures

Figures reproduced from arXiv: 2602.17817 by Bulat Ibragimov, Danyal Yorulmaz, Ehsan Yousefzadeh-Asl-Miandoab, P{\i}nar T\"oz\"un, Reza Karimzadeh.

Figure 2
Figure 2. Figure 2: Staircase growth pattern for memory usage, MLPs on ImageNet [16] and with batch_size=32 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Principal Component Analysis (PCA) of the dataset across different neural network architectures. The figure shows how discretizing the continuous GPU memory usage facilitates formulating the problem as a classification task. In each fold, 70% of the split is used for training and 30% for validation, while the test set is held out separately using an additional 30% split. We evaluate both estimator families… view at source ↗
Figure 4
Figure 4. Figure 4: GPUMemNet Estimator Model Architectures. 3.4 GPU Utilization Estimator Given that GPU utilization metrics remain largely under￾studied compared to GPU memory estimation and the Horus dataset and models are currently unavailable, we train a deep learning-based estimator to predict three utilization metrics — SMACT, SMOCC, and DRAMA — using the dataset we cu￾rated for GPU memory estimation. These metrics, av… view at source ↗
Figure 5
Figure 5. Figure 5: Actual GPU memory need vs Horus’ estimations for MLP models with varying number of neurons and layers [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: GPU memory estimation for real-world unseen CNN and Transformer models using Horus, FakeTensor, and GPUMemNet. FakeTensor fails at Transformer models and GPUMemNet cannot estimate for the unseen model, e.g., DLRM (denoted with X). GPUMemNet provides the closest estimations to actual GPU memory consumption and almost never underestimates. Using utilization as a proxy for resource interference is not straigh… view at source ↗
Figure 7
Figure 7. Figure 7: Principal Component Analysis (PCA) of the datasets across different neural network architectures for utilization metrics [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

Collocating deep learning training tasks improves GPU utilization but risks resource contention, severe slowdowns, and out-of-memory (OOM) failures. Accurate memory estimation is essential for robust collocation, and GPU utilization estimation -- a key proxy for contention -- enables interference-aware scheduling. Existing GPU memory estimators span three paradigms -- analytical models, CPU-side libraries, and ML-based estimators -- each with distinct limitations: dependence on detailed model specifications, intrusive integration, poor generalization, and varying latency overhead. GPU heterogeneity further complicates estimation, as identical tasks can exhibit different memory footprints across hardware generations. GPU utilization remains comparatively understudied, further complicated by non-additive utilization metrics and GPU heterogeneity. We conduct a systematic analysis of representative estimators from each paradigm -- Horus, PyTorch FakeTensor, and our lightweight ML-based estimator -- evaluating accuracy, generalizability, and overhead. We construct a synthetic dataset spanning MLPs, CNNs, and Transformers with controlled architectural variations, and train MLP- and Transformer-based estimators for memory prediction, and experiment with utilization estimation. Our evaluation reveals key tradeoffs and validates estimators against real-world unseen models. Significant challenges remain: analytical models lack generalization and cannot easily be extended to new GPU architectures or accurately reflect memory optimization savings; CPU-side libraries impose intrusive integration overhead; and both analytical and ML-based estimators rely on model specifications or computation graphs, limiting generalization across diverse architectures and hardware variants. We release all datasets, tools, and artifacts to support further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript analyzes GPU memory and utilization estimation techniques for collocating deep learning training tasks on GPUs. It evaluates three paradigms—analytical models like Horus, CPU-side libraries such as PyTorch FakeTensor, and ML-based estimators—using a synthetic dataset spanning MLPs, CNNs, and Transformers with controlled variations, followed by validation on real-world unseen models. The paper identifies limitations in generalization across architectures and hardware, integration overhead, and reliance on model specifications, while releasing all datasets, tools, and artifacts.

Significance. If the results hold, the work provides valuable insights into the tradeoffs and shortcomings of existing estimators, which are critical for improving resource management, reducing OOM failures, and enabling better collocation in GPU clusters. The emphasis on GPU heterogeneity and the release of artifacts enhance its utility for the community.

major comments (1)
  1. [Evaluation] Evaluation section: The central claim that analytical, library, and ML-based estimators lack generalization (and cannot easily extend to new architectures or reflect optimizations) rests on evaluation using a synthetic dataset of MLPs/CNNs/Transformers plus validation on real-world unseen models. Without quantitative measures of distributional shift between the synthetic and validation sets (e.g., in layer types, computation graphs, or memory patterns), the accuracy drops and limitations observed may reflect dataset construction rather than fundamental estimator shortcomings. This is load-bearing for the conclusions about poor cross-architecture generalization.
minor comments (1)
  1. [Abstract] Abstract: The evaluation plan is outlined clearly but lacks specific accuracy metrics, statistical details, or full methodology, which would help readers assess the strength of the reported tradeoffs and validation results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation methodology. We address the major comment below and will revise the manuscript to strengthen the claims with additional quantitative analysis.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The central claim that analytical, library, and ML-based estimators lack generalization (and cannot easily extend to new architectures or reflect optimizations) rests on evaluation using a synthetic dataset of MLPs/CNNs/Transformers plus validation on real-world unseen models. Without quantitative measures of distributional shift between the synthetic and validation sets (e.g., in layer types, computation graphs, or memory patterns), the accuracy drops and limitations observed may reflect dataset construction rather than fundamental estimator shortcomings. This is load-bearing for the conclusions about poor cross-architecture generalization.

    Authors: We agree that quantitative measures of distributional shift would strengthen the evaluation and help demonstrate that the observed limitations reflect fundamental estimator shortcomings rather than artifacts of dataset construction. In the revised manuscript, we will add a dedicated subsection in the Evaluation section that compares the synthetic training set and real-world validation models along dimensions including layer type frequencies, computation graph statistics (depth, branching factor, operator diversity), and memory footprint distributions. We will report statistical distances (e.g., total variation distance on categorical features and Wasserstein distance on continuous memory patterns) together with supporting visualizations. This addition will make the generalization claims more robust while preserving the original experimental design. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper constructs a synthetic dataset spanning MLPs, CNNs, and Transformers, trains its own MLP- and Transformer-based estimators on it, and evaluates accuracy/generalizability against real-world unseen models using external tools (Horus, PyTorch FakeTensor). Conclusions about estimator limitations are empirical outcomes of these comparisons rather than any self-definitional reduction, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations or steps reduce by construction to the paper's inputs; the work releases artifacts for independent verification and remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the chosen representative estimators and synthetic dataset are sufficient to expose general limitations, with no free parameters or invented entities explicitly introduced in the abstract.

axioms (1)
  • domain assumption The synthetic dataset spanning MLPs, CNNs, and Transformers with controlled architectural variations adequately represents real-world model diversity for testing generalizability.
    Invoked to support claims about estimator performance on unseen models and hardware variants.

pith-pipeline@v0.9.0 · 5596 in / 1284 out tokens · 23716 ms · 2026-05-15T20:25:12.750308+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    Murray, Benoit Steiner, Paul Tucker, Vijay Va- sudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng

    Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irv- ing, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Va- sudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng

  2. [2]

    TensorFlow: A system for large-scale machine learning. InOSDI

  3. [3]

    Paul, and Ali R

    Hadeel Albahar, Shruti Dongare, Yanlin Du, Nannan Zhao, Arnab K. Paul, and Ali R. Butt. 2022. SchedTune: A Heterogeneity-Aware GPU Scheduler for Deep Learning. In2022 22nd IEEE International Sym- posium on Cluster, Cloud and Internet Computing (CCGrid). 695–705. doi:10.1109/CCGrid54584.2022.00079

  4. [4]

    PyTorch Contributors. 2025. Fake Tensor Mode in PyTorch.https: //pytorch.org/docs/stable/torch.compiler_fake_tensor.htmlAccessed: 2025-06-01

  5. [5]

    NVIDIA Corporation. [n. d.]. NVIDIA System Management User Guide.https://docs.nvidia.com/datacenter/nvsm/nvsm-user-guide/ latest/nvsm-user-guide.pdf. Accessed: 2025-06-01

  6. [6]

    NVIDIA Corporation. 2024. NVIDIA DCGM.https://developer.nvidia. com/dcgm. Accessed: 2026-02-09

  7. [7]

    Ubuntu Documentation. [n. d.]. Ubuntu Manuals top command.https: //manpages.ubuntu.com/manpages/focal/man1/top.1.html. Accessed: 2025-06-01

  8. [8]

    Yanjie Gao, Xianyu Gu, Hongyu Zhang, Haoxiang Lin, and Mao Yang

  9. [9]

    In2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)

    Runtime performance prediction for deep learning models with graph neural network. In2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 368–380

  10. [10]

    Yanjie Gao, Yichen He, Xinze Li, Bo Zhao, Haoxiang Lin, Yoyo Liang, Jing Zhong, Hongyu Zhang, Jingzhou Wang, Yonghua Zeng, et al

  11. [11]

    InProceedings of the IEEE/ACM 46th International Conference on Software Engineering

    An Empirical Study on Low GPU Utilization of Deep Learning Jobs. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  12. [12]

    Yanjie Gao, Yu Liu, Hongyu Zhang, Zhengxian Li, Yonghao Zhu, Haox- iang Lin, and Mao Yang. 2020. Estimating GPU memory consumption of deep learning models. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1342–1352

  13. [13]

    Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, unjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. InPro- ceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference(Renton, WA, USA)(USENIX ATC ’19). USENIX Association, USA, 947–960

  14. [14]

    Taeho Kim, Yanming Wang, Vatshank Chaturvedi, Lokesh Gupta, Seyeon Kim, Yongin Kwon, and Sangtae Ha. 2024. LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs.arXiv preprint arXiv:2404.10933(2024)

  15. [15]

    Microsoft DeepSpeed Team. 2023. DeepSpeed Memory Requirements. https://deepspeed.readthedocs.io/en/latest/memory.html. Accessed: GPU Memory and Utilization Estimation for Training-Aware Resource Management February 2026

  16. [16]

    Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. 2020. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 481–498.https://www.usenix.org/conference/ osdi20/presentation/narayanan-deepak

  17. [17]

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala

  18. [18]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library. InNIPS. 8026–8037

  19. [19]

    Ties Robroek, Ehsan Yousefzadeh-Asl-Miandoab, and Pınar Tözün

  20. [20]

    InProceedings of the 4th Workshop on Machine Learning and Systems

    An Analysis of Collocation on GPUs for Deep Learning Training. InProceedings of the 4th Workshop on Machine Learning and Systems. 81–90

  21. [21]

    International Journal of Computer Vision 131(1), 284–301 (2023)

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV)115, 3 (2015), 211–252. doi:10.1007/s11263- 015-0816-y

  22. [22]

    Jiabo Shi, Dimitrios Pezaros, and Yehia Elkhatib. 2025. xMem: A CPU- Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads. InProceedings of the 26th International Middleware Conference(Vanderbilt University, Nashville, TN, USA) (Middleware ’25). Association for Computing Machinery, New York, NY, USA, 256–269. doi:10.114...

  23. [23]

    Foteini Strati, Xianzhe Ma, and Ana Klimovic. 2024. Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications. InProceedings of the Nineteenth European Conference on Computer Systems. 1075–1092

  24. [24]

    Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. 2022. MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Het- erogeneous GPU Clusters. In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 945–960

  25. [25]

    Ross Wightman. 2019. PyTorch Image Models.https://github.com/ rwightman/pytorch-image-models. doi:10.5281/zenodo.4414861

  26. [26]

    Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective Cluster Scheduling for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI 18). USENIX Association, Carlsbad...

  27. [27]

    Gingfung Yeung, Damian Borowiec, Renyu Yang, Adrian Friday, Richard Harper, and Peter Garraghan. 2022. Horus: Interference- Aware and Prediction-Based Scheduling in Deep Learning Systems. IEEE Transactions on Parallel and Distributed Systems33, 1 (2022), 88–100. doi:10.1109/TPDS.2021.3079202

  28. [28]

    Ehsan Yousefzadeh-Asl-Miandoab, Ties Robroek, and Pinar Tözün

  29. [29]

    Decentralized learning made easy with decentralizepy

    Profiling and Monitoring Deep Learning Training Tasks. In Proceedings of the 3rd Workshop on Machine Learning and Systems, EuroMLSys 2023, Rome, Italy, 8 May 2023, Eiko Yoneki and Luigi Nardi (Eds.). ACM, 18–25. doi:10.1145/3578356.3592589 Yousefzadeh-Asl-Miandoab et al. 7 Appendix −2 0 2 4 6 8 PC1 −4 −2 0 2 4 PC2 0 1 2 SMACT Class SMACT utilization (%) 0...