Recognition: unknown
Strait: Perceiving Priority and Interference in ML Inference Serving
Pith reviewed 2026-05-07 07:43 UTC · model grok-4.3
The pith
Strait reduces deadline violations for high-priority ML inference tasks by 1 to 11 percentage points through interference prediction on GPUs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Strait enhances deadline satisfaction for dual-priority inference traffic under high GPU utilization. To improve latency estimation, it models potential contention during data transfer and accounts for kernel execution interference through an adaptive prediction model. By drawing on these predictions, it performs priority-aware scheduling to deliver differentiated handling. Evaluation results under intense workloads suggest that Strait reduces deadline violations for high-priority tasks by 1.02 to 11.18 percentage points while incurring acceptable costs on low-priority tasks and exhibits more equitable performance than software-defined preemption approaches.
What carries the argument
An adaptive prediction model that estimates data-transfer contention and kernel-execution interference to drive priority-aware scheduling decisions for dual-priority ML inference workloads.
If this is right
- High-priority inference tasks meet deadlines more reliably under intense GPU utilization.
- Low-priority tasks experience only modest increases in latency or violations.
- The system achieves more balanced outcomes across priority classes than preemption-based alternatives.
- Latency estimates become more reliable when multiple DNN models execute concurrently on the same GPUs.
Where Pith is reading between the lines
- The same contention-modeling idea could be generalized to handle more than two priority levels if the prediction accuracy holds.
- Production platforms might use this technique to run mixed-priority workloads on fewer GPUs without dedicated hardware partitions.
- Extending the model to include network or CPU interference could improve scheduling in heterogeneous inference clusters.
Load-bearing premise
The adaptive prediction model for data-transfer contention and kernel-execution interference remains accurate enough under real production dual-priority workloads to support effective priority-aware scheduling decisions.
What would settle it
A production-style dual-priority workload run where disabling the interference prediction model produces no reduction (or an increase) in high-priority deadline violations compared to the full Strait scheduler.
Figures
read the original abstract
Machine learning (ML) inference serving systems host deep neural network (DNN) models and schedule incoming inference requests across deployed GPUs. However, limited support for task prioritization and insufficient latency estimation under concurrent execution may restrict their applicability in on-premises scenarios. We present \emph{Strait}, a serving system designed to enhance deadline satisfaction for dual-priority inference traffic under high GPU utilization. To improve latency estimation, Strait models potential contention during data transfer and accounts for kernel execution interference through an adaptive prediction model. By drawing on these predictions, it performs priority-aware scheduling to deliver differentiated handling. Evaluation results under intense workloads suggest that Strait reduces deadline violations for high-priority tasks by 1.02 to 11.18 percentage points while incurring acceptable costs on low-priority tasks. Compared to software-defined preemption approaches, Strait also exhibits more equitable performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Strait, an ML inference serving system for GPUs that employs an adaptive prediction model to estimate data-transfer contention and kernel-execution interference. These estimates feed into priority-aware scheduling for dual-priority workloads. Under intense workloads, Strait is claimed to reduce high-priority deadline violations by 1.02–11.18 percentage points relative to baselines, while imposing acceptable costs on low-priority tasks and achieving more equitable performance than software-defined preemption.
Significance. If the adaptive prediction model proves accurate and the reported gains are causally attributable to it (rather than to basic priority queuing or workload artifacts), Strait would offer a practical contribution to on-premises inference serving by improving deadline satisfaction for prioritized traffic at high GPU utilization. The empirical focus on real dual-priority workloads is a strength, but the absence of isolated model validation and detailed experimental parameters substantially weakens the immediate significance and reproducibility of the results.
major comments (3)
- [Abstract] Abstract: The headline quantitative claim (1.02–11.18 pp reduction in high-priority deadline violations) is presented without workload parameters (model types, request rates, GPU utilization), baseline details, number of trials, or error bars. This prevents evaluation of whether the range reflects consistent gains or sensitivity to specific conditions.
- [Evaluation] Evaluation section: No prediction-error metrics (e.g., latency MAE or accuracy under concurrent dual-priority execution) are reported for the adaptive contention/interference model, and no ablation disables the estimator while retaining the rest of the scheduler. Consequently, the causal link between the model’s predictions and the observed violation reductions cannot be verified; gains could arise from other scheduling mechanisms.
- [System Design] System Design section: The adaptive prediction model’s update triggers, input features, and handling of interference under varying priority mixes are described at too high a level to assess robustness or reproducibility. Without these internals, it is impossible to determine whether the model remains sufficiently accurate under the high-utilization workloads used in the evaluation.
minor comments (3)
- [Abstract] The abstract states 'acceptable costs on low-priority tasks' and 'more equitable performance' without defining the metrics (e.g., latency increase, throughput loss) or providing the corresponding quantitative values.
- [Related Work] Related-work discussion should explicitly compare the adaptive model to prior GPU interference predictors (e.g., those based on kernel profiling or ML-based contention estimation) to clarify novelty.
- [Figures/Tables] Figure and table captions would benefit from explicit statements of the number of experimental runs and the precise definition of 'deadline violation' used in the plots.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below. We agree that several clarifications and additions will strengthen the paper's reproducibility and help establish the contribution of the adaptive prediction model. We outline the specific revisions we plan to incorporate in the next version.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline quantitative claim (1.02–11.18 pp reduction in high-priority deadline violations) is presented without workload parameters (model types, request rates, GPU utilization), baseline details, number of trials, or error bars. This prevents evaluation of whether the range reflects consistent gains or sensitivity to specific conditions.
Authors: We acknowledge that the abstract, due to its brevity, does not include all experimental parameters. The reported range reflects results across multiple dual-priority workloads (including ResNet-50, BERT, and VGG models at request rates that drive 80–95% GPU utilization) with FIFO and preemption baselines, averaged over 5 runs per configuration (error bars appear in the corresponding figures). In the revised manuscript we will expand the abstract by one sentence to list representative parameters and explicitly state that full workload details, baselines, and trial counts are provided in Section 5 and Table 1. revision: yes
-
Referee: [Evaluation] Evaluation section: No prediction-error metrics (e.g., latency MAE or accuracy under concurrent dual-priority execution) are reported for the adaptive contention/interference model, and no ablation disables the estimator while retaining the rest of the scheduler. Consequently, the causal link between the model’s predictions and the observed violation reductions cannot be verified; gains could arise from other scheduling mechanisms.
Authors: We agree that isolating the estimator’s contribution is important for establishing causality. The current evaluation focuses on end-to-end system performance, but we will add (1) latency prediction MAE and accuracy figures under concurrent dual-priority execution and (2) an ablation that compares the full Strait scheduler against a priority-aware baseline that disables the adaptive estimator (relying only on static estimates and basic queuing). These additions will be placed in a new subsection of the evaluation and will directly address whether the observed 1.02–11.18 pp reductions are attributable to the contention and interference modeling. revision: yes
-
Referee: [System Design] System Design section: The adaptive prediction model’s update triggers, input features, and handling of interference under varying priority mixes are described at too high a level to assess robustness or reproducibility. Without these internals, it is impossible to determine whether the model remains sufficiently accurate under the high-utilization workloads used in the evaluation.
Authors: We accept that the system-design description is currently high-level. The adaptive model performs online updates when observed latency deviates beyond a configurable threshold (currently 15%), using input features that include instantaneous GPU utilization, per-request data-transfer size, kernel launch parameters, and the current high/low-priority request ratio. Interference is modeled via two separate lightweight regressors (one for PCIe contention, one for SM/kernel interference) that are retrained on recent observations. In the revised version we will add pseudocode for the update loop, an explicit list of input features with their normalization, and a paragraph describing behavior under different priority mixes (e.g., 70/30 vs. 90/10). These details will allow readers to assess robustness at the high-utilization regimes reported in the evaluation. revision: yes
Circularity Check
No circularity in empirical system design and evaluation
full rationale
The paper presents Strait as an empirical ML inference serving system that incorporates an adaptive prediction model for data-transfer contention and kernel-execution interference to enable priority-aware scheduling. No mathematical derivation chain, equations, or first-principles results are described. The central claims rest on end-to-end experimental results (deadline violation reductions under workloads) rather than any fitted parameter renamed as a prediction or self-referential definition. No self-citations, uniqueness theorems, or ansatzes are invoked in the abstract or description to support load-bearing steps. The contribution is self-contained as a systems engineering and evaluation effort without reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
2020. Terminology used in Nsight Compute.https: //stackoverflow.com/questions/63403203/terminology-used- in-nsight-compute?rq=1
-
[2]
Apache Hadoop YARN.https://hadoop.apache.org/docs/ current/hadoop-yarn/hadoop-yarn-site/YARN.html xi Haidong Zhao and Nikolaos Georgantas
2026. Apache Hadoop YARN.https://hadoop.apache.org/docs/ current/hadoop-yarn/hadoop-yarn-site/YARN.html xi Haidong Zhao and Nikolaos Georgantas
2026
-
[3]
CUDA C++ Programming Guide: v13.1.https://docs.nvidia
2026. CUDA C++ Programming Guide: v13.1.https://docs.nvidia. com/cuda/pdf/CUDA_C_Programming_Guide.pdf
2026
-
[4]
Kubernetes.https://kubernetes.io/
2026. Kubernetes.https://kubernetes.io/
2026
-
[5]
MULTI-PROCESS SERVICE: vR590.https://docs.nvidia.com/ deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
2026. MULTI-PROCESS SERVICE: vR590.https://docs.nvidia.com/ deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
2026
-
[6]
nuScenes.https://www.nuscenes.org/nuscenes#data- collection
2026. nuScenes.https://www.nuscenes.org/nuscenes#data- collection
2026
-
[7]
NVIDIA ADA LOVELACE PROFESSIONAL GPU AR- CHITECTURE.https://images.nvidia.com/aem-dam/en- zz/Solutions/technologies/NVIDIA-ADA-GPU-PROVIZ- Architecture-Whitepaper_1.1.pdf
2026. NVIDIA ADA LOVELACE PROFESSIONAL GPU AR- CHITECTURE.https://images.nvidia.com/aem-dam/en- zz/Solutions/technologies/NVIDIA-ADA-GPU-PROVIZ- Architecture-Whitepaper_1.1.pdf
2026
-
[8]
NVIDIA Dynamo Platform.https://developer.nvidia.com/ dynamo
2026. NVIDIA Dynamo Platform.https://developer.nvidia.com/ dynamo
2026
-
[9]
NVIDIA Nsight Compute.https://developer.nvidia.com/nsight- compute
2026. NVIDIA Nsight Compute.https://developer.nvidia.com/nsight- compute
2026
-
[10]
NVIDIA Nsight Systems.https://developer.nvidia.com/nsight- systems
2026. NVIDIA Nsight Systems.https://developer.nvidia.com/nsight- systems
2026
-
[11]
NVIDIA Triton Inference Server.https://developer.nvidia.com/ triton-inference-server
2026. NVIDIA Triton Inference Server.https://developer.nvidia.com/ triton-inference-server
2026
-
[12]
ONNX Runtime.https://onnxruntime.ai/
2026. ONNX Runtime.https://onnxruntime.ai/
2026
-
[13]
Tensorflow Serving shared batch scheduler.https: //github.com/tensorflow/tensorflow/blob/master/tensorflow/core/ kernels/batching_util/shared_batch_scheduler.h
2026. Tensorflow Serving shared batch scheduler.https: //github.com/tensorflow/tensorflow/blob/master/tensorflow/core/ kernels/batching_util/shared_batch_scheduler.h
2026
-
[14]
TensorRT Documentation.https://docs.nvidia.com/ deeplearning/tensorrt/
2026. TensorRT Documentation.https://docs.nvidia.com/ deeplearning/tensorrt/
2026
-
[15]
TorchServe.https://pytorch.org/serve/
2026. TorchServe.https://pytorch.org/serve/
2026
-
[16]
Vivek Adarsh, Michael Nekrasov, Udit Paul, Tarun Mangla, Arpit Gupta, Morgan Vigil-Hayes, Ellen Zegura, and Elizabeth Belding
-
[17]
In2021 International Conference on Computer Communications and Networks (ICCCN)
Coverage is Not Binary: Quantifying Mobile Broadband Quality in Urban, Rural, and Tribal Contexts. In2021 International Conference on Computer Communications and Networks (ICCCN). 1–9. doi:10. 1109/ICCCN52240.2021.9522152
-
[18]
Evidently AI. 2025. What is Concept Drift in ML, and How to Detect and Address It.https://www.evidentlyai.com/ml-in-production/ concept-drift
2025
-
[19]
Tanya Amert, Nathan Otterness, Ming Yang, James H. Anderson, and F. Donelson Smith. 2017. GPU Scheduling on the NVIDIA TX2: Hidden Details Revealed. In2017 IEEE Real-Time Systems Symposium (RTSS). 104–115. doi:10.1109/RTSS.2017.00017
-
[20]
Romil Bhardwaj, Kirthevasan Kandasamy, Asim Biswal, Wenshuo Guo, Benjamin Hindman, Joseph Gonzalez, Michael Jordan, and Ion Stoica. 2023. Cilantro: Performance-Aware Resource Allocation for General Objectives via Online Feedback. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 623–643. https://www.usenix.org/conference/os...
2023
-
[21]
Sumon Kumar Bose, Bapi Kar, Mohendra Roy, Pradeep Kumar Gopalakrishnan, and Arindam Basu. 2019. ADEPOS: Anomaly de- tection based power saving for predictive maintenance using edge computing. InProceedings of the 24th Asia and South Pacific Design Automation Conference (ASPDAC ’19). Association for Computing Machinery, 597–602. doi:10.1145/3287624.3287716
-
[22]
Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. 2014. Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing. In11th USENIX Symposium on Operating Systems Design and Implementa- tion (OSDI 14). 285–300.https://www.usenix.org/conference/osdi14/ technical-sessions/presentation/boutin
2014
-
[23]
Cleveland, Dong Lin, and Don X
Jin Cao, William S. Cleveland, Dong Lin, and Don X. Sun. 2001. On the nonstationarity of Internet traffic. InProceedings of the 2001 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’01). 102–112. doi:10.1145/378420. 378440
-
[24]
Bohsun Chen. 2024. Understanding Huber Loss function: Insights from Applications.https://medium.com/@devcharlie2698619/ understanding-huber-loss-function-insights-from-applications- 5c1c5145d2c4
2024
-
[25]
Quan Chen, Hailong Yang, Minyi Guo, Ram Srivatsa Kannan, Jason Mars, and Lingjia Tang. 2017. Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse- Scale Computers. InProceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’17). ...
-
[26]
Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. 2016. Bay- max: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers. InProceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’16). As- sociation for Computing Machinery,...
-
[27]
Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh. 2022. Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Shar- ing. In2022 USENIX Annual Technical Conference (USENIX ATC 22). 199–216.https://www.usenix.org/conference/atc22/presentation/ choi-seungbeom
2022
-
[28]
Brad Cline, Radu Stefan Niculescu, Duane Huffman, and Bob Deckel
-
[29]
In 2017 Annual Reliability and Maintainability Symposium (RAMS)
Predictive maintenance applications for machine learning. In 2017 Annual Reliability and Maintainability Symposium (RAMS). 1–7. doi:10.1109/RAM.2017.7889679
-
[30]
InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25)
Patrick H. Coppock, Brian Zhang, Eliot H. Solomon, Vasilis Kypriotis, Leon Yang, Bikash Sharma, Dan Schatzberg, Todd C. Mowry, and Dimitrios Skarlatos. 2025. LithOS: An Operating System for Efficient Machine Learning on GPUs. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25). 1–17. doi:10. 1145/3731569.3764818
-
[31]
Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Sto- ica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: Latency- Aware Provisioning and Scaling for Prediction Serving Pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing (SoCC ’20). 477–491. doi:10.1145/3419111.3421285
-
[32]
Franklin, Joseph E
Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 613– 627.https://www.usenix.org/conference/nsdi17/technical-sessions/ presentation/crankshaw
2017
-
[33]
William J. Dally, Stephen W. Keckler, and David B. Kirk. 2021. Evolu- tion of the Graphics Processing Unit (GPU).IEEE Micro41, 6 (2021), 42–51. doi:10.1109/MM.2021.3113475
-
[34]
Priyanka Das. 2024. Real-Time IoT-Based Predictive Maintenance System for Automotive Assembly Lines.Fuel Cells Bulletin(02 2024). doi:10.52710/fcb.224
-
[35]
Ribeiro, Pedro Mota Pereira, and João Gama
Narjes Davari, Bruno Veloso, Rita P. Ribeiro, Pedro Mota Pereira, and João Gama. 2021. Predictive maintenance based on anomaly detection using deep learning for air production unit in the railway industry. In 2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA). 1–10. doi:10.1109/DSAA53316.2021.9564181
-
[36]
Aditya Dhakal, Sameer G Kulkarni, and K. K. Ramakrishnan. 2020. GSLICE: controlled spatial sharing of GPUs for a scalable inference platform. InProceedings of the 11th ACM Symposium on Cloud Com- puting (SoCC ’20). 492–506. doi:10.1145/3419111.3421284
-
[37]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transform- ers for Image Recognition at Scale.https://arxiv.org/abs/2010.11929 xii Strait: Perceiving Priorit...
work page internal anchor Pith review arXiv 2021
-
[38]
John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Sub- gradient Methods for Online Learning and Stochastic Optimiza- tion.Journal of Machine Learning Research12, 61 (2011), 2121–2159. http://jmlr.org/papers/v12/duchi11a.html
2011
-
[39]
Paul Elvinger, Foteini Strati, Natalie Enright Jerger, and Ana Klimovic
-
[40]
InProceedings of the 2025 ACM Symposium on Cloud Computing (SoCC ’25)
Understanding GPU Resource Interference One Level Deeper. InProceedings of the 2025 ACM Symposium on Cloud Computing (SoCC ’25). 687–694. doi:10.1145/3772052.3772270
-
[41]
GigaSpaces. 2023. Amazon Found Every 100ms of Latency Cost them 1% in Sales.https://www.gigaspaces.com/blog/amazon-found-every- 100ms-of-latency-cost-them-1-in-sales
2023
-
[42]
Guin Gilman, Samuel S. Ogden, Tian Guo, and Robert J. Walls. 2021. Demystifying the Placement Policies of the NVIDIA GPU Thread Block Scheduler for Concurrent Kernels.SIGMETRICS Perform. Eval. Rev.48, 3 (March 2021), 81–88. doi:10.1145/3453953.3453972
-
[43]
Guin Gilman and Robert J. Walls. 2021. Characterizing concur- rency mechanisms for NVIDIA GPUs under deep learning workloads. Performance Evaluation151 (2021), 102234. doi:10.1016/j.peva.2021. 102234
-
[44]
Roger Grosse. 2017. Lecture 8: Optimization.https://www.cs.toronto. edu/~cmaddis/courses/sta314_f25/rgrosse_optimization_notes.pdf
2017
-
[45]
Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kauf- mann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving DNNs like Clockwork: Performance Predictability from the Bottom Up. In14th USENIX Symposium on Operating Systems Design and Implementa- tion (OSDI 20). 443–462.https://www.usenix.org/conference/osdi20/ presentation/gujarati
2020
-
[46]
Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 539–558.https://www.usenix. org/conference/osdi22/presentation/han
2022
-
[47]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778. doi:10. 1109/CVPR.2016.90
2016
-
[48]
Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. 2012. Neu- ral Networks for Machine Learning.https://www.cs.toronto.edu/ ~tijmen/csc321/slides/lecture_slides_lec6.pdf
2012
-
[49]
Yitao Hu, Rajrup Ghosh, and Ramesh Govindan. 2021. Scrooge: A Cost-Effective Deep Learning Inference System. InProceedings of the ACM Symposium on Cloud Computing (SoCC ’21). 624–638. doi:10.1145/3472883.3486993
-
[50]
Szu-Hao Huang and Ying-Cheng Pan. 2015. Automated visual inspec- tion in the semiconductor industry: A survey.Computers in Industry 66 (2015), 1–10. doi:10.1016/j.compind.2014.10.006
-
[51]
Nebbiolo Technologies Inc. 2020. Audi’s Automated Factory Moves Closer to Industry 4.0 with Intel’s Edge Machine Learning and Nebbiolo Technologies’ Intelligent Edge Computing Software Platform.https://www.prweb.com/releases/audi-s-automated- factory-moves-closer-to-industry-4-0-with-intel-s-edge-machine- learning-and-nebbiolo-technologies-intelligent-edg...
2020
- [52]
-
[53]
Beomyeol Jeon, Chen Wang, Diana Arroyo, Alaa Youssef, and In- dranil Gupta. 2025. A House United Within Itself: SLO-Awareness for On-Premises Containerized ML Inference Clusters via Faro. In Proceedings of the Twentieth European Conference on Computer Sys- tems (EuroSys ’25). Association for Computing Machinery, 524–540. doi:10.1145/3689031.3696071
-
[54]
Yizhou Jin, Yu Lu, Gang Zhou, Qingjie Liu, and Yunhong Wang. 2023. Glass Wool Defect Detection Using an Improved YOLOv5. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 4385–4394. doi:10.1109/CVPRW59228.2023. 00461
-
[55]
Glenn Jocher, Ayush Chaurasia, and Jing Qiu. 2023. Ultralytics YOLOv8.https://github.com/ultralytics/ultralytics
2023
-
[56]
Karumbunathan
Leela S. Karumbunathan. July 2022. NVIDIA Jetson AGX Orin Series: Technical Brief.https://www.nvidia.com/content/dam/en- zz/Solutions/gtcf21/jetson-orin/nvidia-jetson-agx-orin-technical- brief.pdf
2022
-
[57]
Sejin Kim and Yoonhee Kim. 2021. Interference-aware execution framework with Co-scheML on GPU clusters.Cluster Computing26, 5 (May 2021), 2577–2589. doi:10.1007/s10586-021-03299-z
- [58]
-
[59]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Sto- chastic Optimization.https://arxiv.org/abs/1412.6980
work page internal anchor Pith review arXiv 2017
-
[60]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica
-
[61]
InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23)
Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23). Association for Computing Machinery, 611–626. doi:10.1145/3600006.3613165
-
[62]
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hub- bard, and L. D. Jackel. 1989. Backpropagation Applied to Handwrit- ten Zip Code Recognition.Neural Computation1, 4 (1989), 541–551. doi:10.1162/neco.1989.1.4.541
-
[63]
Seonho Lee, Amar Phanishayee, and Divya Mahajan. 2025. Forecast- ing GPU Performance for Deep Learning Training and Inference. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS ’25). Association for Computing Machinery, New York, NY, USA, 493–508. doi:10.1145/...
-
[64]
C. L. Liu and James W. Layland. 1973. Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment.Journal of the ACM (JACM)20, 1 (Jan. 1973), 46–61. doi:10.1145/321738.321743
-
[65]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoy- anov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Ap- proach.https://arxiv.org/abs/1907.11692
work page internal anchor Pith review arXiv 2019
-
[66]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A ConvNet for the 2020s. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11966–11976. doi:10.1109/CVPR52688.2022.01167
-
[67]
Yadwadkar, and Christos Kozyrakis
Daniel Mendoza, Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. 2021. Interference-Aware Scheduling for Inference Serving. InProceedings of the 1st Workshop on Machine Learning and Systems (EuroMLSys ’21). Association for Computing Machinery, 80–88. doi:10.1145/3437984.3458837
-
[68]
Victor Millnert and Johan Eker. 2020. HoloScale: horizontal and vertical scaling of cloud resources. In2020 IEEE/ACM 13th Interna- tional Conference on Utility and Cloud Computing (UCC). 196–205. doi:10.1109/UCC48980.2020.00038
-
[69]
Kelvin K. W. Ng, Henri Maxime Demoulin, and Vincent Liu. 2023. Paella: Low-latency Model Serving with Software-defined GPU Scheduling. InProceedings of the 29th Symposium on Operating Sys- tems Principles (SOSP ’23). Association for Computing Machinery, 595–610. doi:10.1145/3600006.3613163
-
[70]
Noghabi, Landon Cox, Sharad Agarwal, and Ganesh Anan- thanarayanan
Shadi A. Noghabi, Landon Cox, Sharad Agarwal, and Ganesh Anan- thanarayanan. 2020. The Emerging Landscape of Edge Comput- ing.GetMobile: Mobile Comp. and Comm.23, 4 (May 2020), 11–20. doi:10.1145/3400713.3400717 xiii Haidong Zhao and Nikolaos Georgantas
-
[71]
Christopher Olston, Fangwei Li, Jeremiah Harmsen, Jordan Soyke, Kiril Gorovoy, Li Lao, Noah Fiedel, Sukriti Ramesh, and Vinu Ra- jashekhar. 2017. TensorFlow-Serving: Flexible, High-Performance ML Serving. InWorkshop on ML Systems at NIPS 2017
2017
-
[72]
Nathan Otterness and James H. Anderson. 2020. AMD GPUs as an Alternative to NVIDIA for Supporting Real-Time Workloads. In 32nd Euromicro Conference on Real-Time Systems (ECRTS 2020), Mar- cus Völp (Ed.), Vol. 165. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 10:1–10:23. doi:10.4230/LIPIcs.ECRTS.2020.10
-
[73]
Arthi Padmanabhan, Neil Agarwal, Anand Iyer, Ganesh Anantha- narayanan, Yuanchao Shu, Nikolaos Karianakis, Guoqing Harry Xu, and Ravi Netravali. 2023. Gemel: Model Merging for Memory- Efficient, Real-Time Video Analytics at the Edge. In20th USENIX Sym- posium on Networked Systems Design and Implementation (NSDI 23). 973–994.https://www.usenix.org/conferen...
2023
-
[74]
Ning Qian. 1999. On the momentum term in gradient descent learning algorithms.Neural Networks12, 1 (1999), 145–151. doi:10.1016/S0893- 6080(98)00116-6
-
[75]
On the Convergence of Adam and Beyond
Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. 2019. On the Convergence of Adam and Beyond.https://arxiv.org/abs/1904.09237
work page Pith review arXiv 2019
-
[76]
Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B
Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kan...
-
[77]
MLPerf inference benchmark. InProceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA ’20). IEEE Press, 446–459. doi:10.1109/ISCA45697.2020.00045
-
[78]
Deloitte Research. 2020. Milliseconds Make Millions. https://www.deloitte.com/ie/en/services/consulting/research/ milliseconds-make-millions.html
2020
-
[79]
Yadwadkar, and Christos Kozyrakis
Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. 2021. INFaaS: Automated Model-less Inference Serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 397–411. https://www.usenix.org/conference/atc21/presentation/romero
2021
-
[80]
Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms.https://www.ruder.io/optimizing-gradient-descent
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.