Mitigating GIL Bottlenecks in Edge AI Systems

Mridankan Mandal; Smit Sanjay Shende

arxiv: 2601.10582 · v4 · submitted 2026-01-15 · 💻 cs.DC · cs.OS· cs.PF

Mitigating GIL Bottlenecks in Edge AI Systems

Mridankan Mandal , Smit Sanjay Shende This is my paper

Pith reviewed 2026-05-16 14:04 UTC · model grok-4.3

classification 💻 cs.DC cs.OScs.PF

keywords GIL bottlenecksedge AIthread pool optimizationBlocking Ratio metricPython performancesaturation cliffadaptive runtime

0 comments

The pith

A Blocking Ratio metric allows adaptive thread management to reach 96.5% of optimal performance in GIL-limited edge AI systems without manual tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Python on edge devices needs many threads to hide I/O latency, but the Global Interpreter Lock serializes them and causes a saturation cliff where performance drops over 20% at high thread counts. The paper introduces a lightweight tool that calculates a Blocking Ratio metric, beta, to tell apart real I/O waits from GIL blocks, then uses this to adjust the runtime adaptively. This library approach hits 96.5% of the best possible speed across seven different edge AI workloads, including ONNX-based ML inference, while using far less memory than multiprocessing and avoiding the CPU stalls of asyncio. Tests with free-threading Python confirm the metric works in both environments.

Core claim

The central discovery is a library-based adaptive runtime system driven by the Blocking Ratio metric (beta) that distinguishes genuine I/O wait from GIL contention, achieving 96.5% of optimal performance without manual tuning and 93.9% average efficiency on edge devices.

What carries the argument

The Blocking Ratio metric (beta) that quantifies the fraction of blocked time attributable to I/O versus GIL, powering an adaptive runtime to scale thread pools dynamically.

If this is right

Reaches 93.9% average efficiency across seven edge AI workload profiles including real ML inference.
Outperforms multiprocessing by avoiding ~8x memory overhead on 512 MB-2 GB RAM devices.
Avoids the blocking during CPU-bound phases seen in asyncio.
Maintains effectiveness in both standard GIL Python and free-threading Python 3.13t, though single-core devices still face context switch costs.
Prevents the saturation cliff degradation of 20% or more at thread counts of 512 or higher.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could apply similar beta-based tuning to other lock-based runtimes beyond Python on constrained hardware.
The approach might reduce the need for manual profiling in production edge deployments where workloads vary.
Further gains could come from combining beta adaptation with hardware-specific thread affinity on multi-core edges.
Validation on additional real-world IoT sensors or camera streams would test generalizability beyond the seven profiles studied.

Load-bearing premise

The Blocking Ratio metric reliably distinguishes genuine I/O wait from GIL contention across diverse edge AI workloads without adding significant overhead or requiring workload-specific tuning.

What would settle it

Running the system on a new edge workload where beta misidentifies contention sources, resulting in thread counts that yield less than 80% of optimal performance.

Figures

Figures reproduced from arXiv: 2601.10582 by Mridankan Mandal, Smit Sanjay Shende.

**Figure 3.** Figure 3: Latency Analysis. P99 latency vs. thread count. Shaded bands indicate [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Runtime System Architecture. Instrumentor captures timing; Monitor [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Controller Flow Diagram. Feedback loop driven by [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Convergence Proof Visualization. Blocking characteristic [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: Workload Sweep Heatmap. Throughput (TPS) across workload types [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 7.** Figure 7: Baseline Strategy Comparison. Throughput and memory overhead [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

Deploying Python-based AI agents on resource-constrained edge devices presents a critical runtime optimization challenge: high thread counts are needed to mask I/O latency, yet Python's Global Interpreter Lock (GIL) serializes execution. We demonstrate that naive thread pool scaling causes a "saturation cliff": a performance degradation of >= 20% at overprovisioned thread counts (N >= 512) on edge representative configurations. We present a lightweight profiling tool and adaptive runtime system that uses a Blocking Ratio metric (beta) to distinguish genuine I/O wait from GIL contention. Our library-based solution achieves 96.5% of optimal performance without manual tuning, outperforming multiprocessing (which is limited by ~8x memory overhead on devices with 512 MB-2 GB RAM) and asyncio (which blocks during CPU bound phases). Evaluation across seven edge AI workload profiles, including real ML inference with ONNX Runtime MobileNetV2, demonstrates 93.9% average efficiency. Comparative experiments with Python 3.13t (free-threading) show that while GIL elimination enables ~4x throughput on multi-core edge devices, the saturation cliff persists on single-core devices due to context switching overhead, validating our beta metric for both GIL and no-GIL environments. This work provides a practical optimization strategy for memory-constrained edge AI systems where traditional solutions fail.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a usable beta metric for spotting GIL vs I/O in edge Python workloads and shows solid practical gains, but the evaluation leaves the no-tuning claim only partly checked.

read the letter

The core contribution is a lightweight Blocking Ratio metric (beta) plus an adaptive runtime that switches threading behavior to dodge the saturation cliff at high thread counts. It reports 96.5% of optimal performance and 93.9% average efficiency on seven workloads, including real MobileNetV2 inference, while staying inside tight memory limits where multiprocessing blows up and asyncio stalls on CPU phases. The comparison to Python 3.13t free-threading is also useful, showing the cliff persists on single-core devices from context-switch costs. That combination of a new diagnostic and a library-based fix is the part that feels fresh for edge AI practitioners.

Referee Report

3 major / 2 minor

Summary. The paper introduces a library-based adaptive runtime for Python edge AI systems that uses a Blocking Ratio metric (beta) to distinguish I/O wait from GIL contention, thereby avoiding the saturation cliff that occurs with naive thread scaling at N >= 512. It claims 96.5% of optimal performance and 93.9% average efficiency across seven workloads (including ONNX MobileNetV2 inference) with no manual tuning, while outperforming multiprocessing (due to memory overhead) and asyncio (due to blocking on CPU phases). The work also validates the approach under Python 3.13t free-threading, noting that the cliff persists on single-core devices due to context switching.

Significance. If the beta metric is shown to be workload- and device-invariant with proper validation, the result would be significant for practical deployment of high-concurrency Python AI agents on memory-limited edge hardware (512 MB-2 GB RAM), where standard multiprocessing and asyncio solutions are shown to be inadequate. The inclusion of real ML inference workloads and free-threading comparisons strengthens the practical relevance.

major comments (3)

[Abstract / Evaluation] Abstract and evaluation: The headline claims of 96.5% of optimal performance and 93.9% average efficiency are presented without error bars, raw results tables, data exclusion criteria, or statistical details on the seven workloads, leaving the central outperformance assertions only partially verifiable and undermining reproducibility.
[Method / Evaluation] Blocking Ratio metric (beta): The adaptation logic relies on beta to set throttling thresholds without manual tuning, yet no derivation, equation, or cross-validation is supplied demonstrating that beta's decision boundary remains invariant under changes in device clock speed, I/O latency distributions, or relative I/O-vs-compute phase durations across the tested profiles (MobileNetV2, sensor polling, etc.).
[Evaluation] Free-threading comparison: The observation that the saturation cliff persists on single-core devices under Python 3.13t is load-bearing for the claim that beta remains useful beyond GIL environments, but the manuscript does not report the computational overhead of beta sampling itself or any ablation showing that the metric does not itself require workload-specific calibration.

minor comments (2)

[Abstract] Abstract: The term 'Blocking Ratio metric (beta)' is introduced without a concise inline definition or pointer to its computation algorithm, which would aid readers.
[Evaluation] Evaluation: Plots or tables should explicitly label all seven workload profiles and include baseline memory footprints for the multiprocessing comparison to make the ~8x overhead claim directly inspectable.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We agree that the manuscript would benefit from additional statistical details, an explicit derivation of the beta metric, and overhead/ablation data. We will incorporate these changes in the revised version.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and evaluation: The headline claims of 96.5% of optimal performance and 93.9% average efficiency are presented without error bars, raw results tables, data exclusion criteria, or statistical details on the seven workloads, leaving the central outperformance assertions only partially verifiable and undermining reproducibility.

Authors: We acknowledge the need for greater transparency. In the revision we will add error bars (standard deviation over 10 runs per workload), a supplementary table listing raw throughput and efficiency numbers for all seven workloads, the number of repetitions, and explicit data exclusion criteria (none were applied beyond discarding the first 5 seconds of each run for warm-up). revision: yes
Referee: [Method / Evaluation] Blocking Ratio metric (beta): The adaptation logic relies on beta to set throttling thresholds without manual tuning, yet no derivation, equation, or cross-validation is supplied demonstrating that beta's decision boundary remains invariant under changes in device clock speed, I/O latency distributions, or relative I/O-vs-compute phase durations across the tested profiles (MobileNetV2, sensor polling, etc.).

Authors: We will insert the formal definition beta = (time in blocking I/O) / (total wall-clock time) together with the sampling procedure and the threshold-selection rule. We will also add a new subsection with cross-validation experiments on two additional edge boards (different clock speeds and I/O latencies) showing that the same beta thresholds remain effective without retuning. We will note that invariance outside the tested device and workload envelope is an assumption that future work should test. revision: yes
Referee: [Evaluation] Free-threading comparison: The observation that the saturation cliff persists on single-core devices under Python 3.13t is load-bearing for the claim that beta remains useful beyond GIL environments, but the manuscript does not report the computational overhead of beta sampling itself or any ablation showing that the metric does not itself require workload-specific calibration.

Authors: We will add measured overhead figures for beta sampling (< 0.8 % CPU on the target hardware) and a new ablation table comparing adaptive beta throttling versus fixed-threshold and no-throttling baselines across all workloads. The ablation confirms that a single set of beta thresholds works for both GIL and free-threading builds without per-workload recalibration. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation of Blocking Ratio or performance claims

full rationale

The abstract presents beta as an empirically measured profiling metric for distinguishing I/O wait from GIL contention, with the 96.5% optimal performance and 93.9% efficiency claims resting on direct evaluation across seven workloads rather than any self-referential definition, fitted parameter renamed as prediction, or self-citation chain. No equations, ansatzes, or uniqueness theorems are quoted that reduce the adaptive runtime or saturation-cliff mitigation to the input data by construction; the 'no manual tuning' assertion is framed as an outcome of the lightweight tool itself. The derivation is therefore self-contained and externally falsifiable via the reported comparative experiments.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

Central claim rests on the validity of the newly introduced beta metric and standard assumptions about GIL behavior and edge device constraints; limited information available from abstract only.

free parameters (1)

beta adaptation thresholds
Parameters likely used to decide runtime adjustments based on measured beta; not specified in abstract but required for the adaptive system.

axioms (2)

standard math Python GIL serializes CPU-bound execution across threads
Standard CPython implementation detail invoked in the problem statement.
domain assumption High thread counts are needed to mask I/O latency on edge devices
Core premise of the saturation cliff observation.

invented entities (1)

Blocking Ratio metric (beta) no independent evidence
purpose: Quantify and distinguish I/O wait from GIL contention for adaptive control
Newly defined metric central to the solution; no independent evidence outside this work.

pith-pipeline@v0.9.0 · 5545 in / 1521 out tokens · 49454 ms · 2026-05-16T14:04:14.925509+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

[1]

What Edge Computing Means for Infrastructure and Operations Leaders,

S. Rao, “What Edge Computing Means for Infrastructure and Operations Leaders,” Gartner, Inc., 2018. [Online]. Available: https://www.gartner.com/smarterwithgartner/what-edge-computing- means-for-infrastructure-and-operations-leaders

work page 2018
[2]

Number of Connected IoT Devices Growing 14% to 21.1 Billion Globally in 2025,

IoT Analytics, “Number of Connected IoT Devices Growing 14% to 21.1 Billion Globally in 2025,” IoT Analytics GmbH, Oct. 2024. [Online]. Available: https://iot-analytics.com/number-connected-iot-devices/

work page 2025
[3]

TIOBE Index for December 2024,

TIOBE Software, “TIOBE Index for December 2024,” TIOBE, 2024. [Online]. Available: https://www.tiobe.com/tiobe-index/

work page 2024
[4]

PYPL PopularitY of Programming Language Index,

P. Carbonnelle, “PYPL PopularitY of Programming Language Index,” Dec. 2024. [Online]. Available: https://pypl.github.io/PYPL.html

work page 2024
[5]

Machine Learning Market Size Worldwide 2025-2030,

Statista Research Department, “Machine Learning Market Size Worldwide 2025-2030,” Statista, 2024. [Online]. Available: https://www.statista.com/statistics/1246443/machine-learning-market- size/

work page arXiv 2025
[6]

Inside the Python GIL,

D. Beazley, “Inside the Python GIL,” Presented at Chicago Python User Group, Chicago, IL, June 11, 2009. [Online]. Available: http://www.dabeaz.com/python/GIL.pdf

work page 2009
[7]

Understanding the Python GIL,

D. Beazley, “Understanding the Python GIL,” inProc. PyCon, Atlanta, GA, Feb. 20, 2010. [Online]. Available: http://www.dabeaz.com/GIL/

work page 2010
[8]

SEDA: An Architecture for Well- Conditioned, Scalable Internet Services,

M. Welsh, D. Culler, and E. Brewer, “SEDA: An Architecture for Well- Conditioned, Scalable Internet Services,” inProc. 18th ACM Symposium on Operating Systems Principles (SOSP), 2001, pp. 230-243

work page 2001
[9]

Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters,

C. Delimitrou and C. Kozyrakis, “Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters,” inProc. 18th ACM International Con- ference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2013, pp. 77-88

work page 2013
[10]

Clipper: A Low-Latency Online Prediction Serving System,

D. Crankshaw et al., “Clipper: A Low-Latency Online Prediction Serving System,” inProc. 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2017, pp. 613-627

work page 2017
[11]

PEP 703: Making the Global Interpreter Lock Optional in CPython,

S. Gross, “PEP 703: Making the Global Interpreter Lock Optional in CPython,” Python Enhancement Proposals, 2023

work page 2023
[12]

Why Threads Are A Bad Idea (for most purposes),

J. Ousterhout, “Why Threads Are A Bad Idea (for most purposes),” inInvited Talk at USENIX Annual Technical Conference, San Diego, CA, Jan. 1996. [Online]. Available: https://web.stanford.edu/ ouster/cgi- bin/papers/threads.pdf

work page 1996
[13]

Cortex-A72 Software Optimization Guide,

ARM Ltd., “Cortex-A72 Software Optimization Guide,” ARM Documentation, 2015. [Online]. Available: https://developer.arm.com/documentation/uan0016/a/

work page 2015
[14]

TensorFlow-Serving: Flexible, High-Performance ML Serving,

C. Olston et al., “TensorFlow-Serving: Flexible, High-Performance ML Serving,”Workshop on ML Systems at NIPS 2017. Available: http://learningsys.org/nips17/assets/papers/paper 1.pdf

work page 2017
[15]

BATCH: Machine Learning Inference Serving on Serverless Platforms,

B. Li et al., “BATCH: Machine Learning Inference Serving on Serverless Platforms,” inProc. International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’20), 2020

work page 2020
[16]

INFERBENCH: Understanding Deep Learning Inference Serving Systems,

H. Yang et al., “INFERBENCH: Understanding Deep Learning Inference Serving Systems,”arXiv preprint arXiv:2011.02327, 2020

work page arXiv 2011
[17]

Ray: A Distributed Framework for Emerging AI Applications,

P. Moritz et al., “Ray: A Distributed Framework for Emerging AI Applications,” inProc. 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18), 2018, pp. 561-577

work page 2018
[18]

Dask: Parallel Computation with Blocked Algorithms and Task Scheduling,

M. Rocklin, “Dask: Parallel Computation with Blocked Algorithms and Task Scheduling,” inProc. 14th Python in Science Conf., 2015, pp. 130- 136

work page 2015
[19]

Runtime vs Scheduler: Analyzing Dask’s Overheads,

S. B ¨ohm and J. Ber ´anek, “Runtime vs Scheduler: Analyzing Dask’s Overheads,”arXiv preprint arXiv:2010.11105, 2020

work page arXiv 2010
[20]

Gevent Documentation,

D. Bilenko, “Gevent Documentation,” 2024. [Online]. Available: http://www.gevent.org/

work page 2024
[21]

Greenlet: Lightweight concurrent programming,

A. Borzenkov, “Greenlet: Lightweight concurrent programming,” 2024. [Online]. Available: https://github.com/python-greenlet/greenlet

work page 2024
[22]

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference,

B. Jacob et al., “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference,” inProc. IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704-2713

work page 2018
[23]

Reaching for the Sky: Maximizing Deep Learning Inference Throughput on Edge Devices with AI Multi-Tenancy,

J. Hao, P. Subedi, L. Ramaswamy, and I. K. Kim, “Reaching for the Sky: Maximizing Deep Learning Inference Throughput on Edge Devices with AI Multi-Tenancy,”ACM Trans. Internet Technol., vol. 22, no. 4, Art. 95, 2023

work page 2023
[24]

Multi-Model Running Latency Optimization in an Edge Computing Paradigm,

P. Li et al., “Multi-Model Running Latency Optimization in an Edge Computing Paradigm,”Sensors, vol. 22, no. 16, p. 6097, 2022

work page 2022
[25]

Cross-Platform Optimization of ONNX Models for Mobile and Edge Deployment,

C. Joshua et al., “Cross-Platform Optimization of ONNX Models for Mobile and Edge Deployment,” ResearchGate preprint, June 2025

work page 2025
[26]

Intelligent Edge Computing and Machine Learning: A Survey of Optimization and Applications,

S. A. Cajas Ord ´o˜nez et al., “Intelligent Edge Computing and Machine Learning: A Survey of Optimization and Applications,”Future Internet, vol. 17, no. 9, p. 417, 2025

work page 2025
[27]

Lightweight Transformer Architectures for Edge De- vices in Real-Time Applications,

H. H. Samson, “Lightweight Transformer Architectures for Edge De- vices in Real-Time Applications,”arXiv preprint arXiv:2601.03290, 2026

work page arXiv 2026
[28]

ADAPT-T: An Adaptive Algorithm for Auto-Tuning Worker Thread Pool Size in Application Servers,

N. Costa et al., “ADAPT-T: An Adaptive Algorithm for Auto-Tuning Worker Thread Pool Size in Application Servers,” inProc. IEEE Symposium on Computers and Communications (ISCC), 2019, pp. 1- 6

work page 2019
[29]

Loh, Mark Oskin, and Steven K

V . Podolskiy, A. Jindal, and M. Gerndt, “IaaS Reactive Autoscal- ing Performance Challenges,” inProc. 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), 2018, pp. 539-546. DOI: 10.1109/CLOUD.2018.00075

work page doi:10.1109/cloud.2018.00075 2018
[30]

Towards resource-efficient reactive and proactive auto-scaling for microservice architectures,

H. Ahmad et al., “Towards resource-efficient reactive and proactive auto-scaling for microservice architectures,”J. Syst. Softw., vol. 225, p. 112390, 2025

work page 2025
[31]

Analysis of Optimal Thread Pool Size,

Y . Ling, T. Mullen, and X. Lin, “Analysis of Optimal Thread Pool Size,” ACM SIGOPS Operating Systems Review, vol. 34, no. 2, pp. 42-55, 2000

work page 2000
[32]

Understanding Linux IOWait,

P. Zaitsev, “Understanding Linux IOWait,” Percona Blog, 2023. [Online]. Available: https://www.percona.com/blog/understanding-linux-iowait/

work page 2023
[33]

Identifying On-/Off-CPU Bottlenecks Together with Blocked Samples,

M. Ahn et al., “Identifying On-/Off-CPU Bottlenecks Together with Blocked Samples,” inProc. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’24), 2024

work page 2024
[34]

A Novel Predictive and Self-Adaptive Dynamic Thread Pool Management,

S. Lee, T. Pham, and F. Bahadur, “A Novel Predictive and Self-Adaptive Dynamic Thread Pool Management,” inProc. 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011, pp. 2001-2008. DOI: 10.1109/IPDPS.2011.353

work page doi:10.1109/ipdps.2011.353 2011
[35]

Performance Study and Dynamic Optimization Design for Thread Pool Systems,

D. Xu, “Performance Study and Dynamic Optimization Design for Thread Pool Systems,” Master’s thesis, Florida State Univ., Tallahassee, FL, 2004

work page 2004

[1] [1]

What Edge Computing Means for Infrastructure and Operations Leaders,

S. Rao, “What Edge Computing Means for Infrastructure and Operations Leaders,” Gartner, Inc., 2018. [Online]. Available: https://www.gartner.com/smarterwithgartner/what-edge-computing- means-for-infrastructure-and-operations-leaders

work page 2018

[2] [2]

Number of Connected IoT Devices Growing 14% to 21.1 Billion Globally in 2025,

IoT Analytics, “Number of Connected IoT Devices Growing 14% to 21.1 Billion Globally in 2025,” IoT Analytics GmbH, Oct. 2024. [Online]. Available: https://iot-analytics.com/number-connected-iot-devices/

work page 2025

[3] [3]

TIOBE Index for December 2024,

TIOBE Software, “TIOBE Index for December 2024,” TIOBE, 2024. [Online]. Available: https://www.tiobe.com/tiobe-index/

work page 2024

[4] [4]

PYPL PopularitY of Programming Language Index,

P. Carbonnelle, “PYPL PopularitY of Programming Language Index,” Dec. 2024. [Online]. Available: https://pypl.github.io/PYPL.html

work page 2024

[5] [5]

Machine Learning Market Size Worldwide 2025-2030,

Statista Research Department, “Machine Learning Market Size Worldwide 2025-2030,” Statista, 2024. [Online]. Available: https://www.statista.com/statistics/1246443/machine-learning-market- size/

work page arXiv 2025

[6] [6]

Inside the Python GIL,

D. Beazley, “Inside the Python GIL,” Presented at Chicago Python User Group, Chicago, IL, June 11, 2009. [Online]. Available: http://www.dabeaz.com/python/GIL.pdf

work page 2009

[7] [7]

Understanding the Python GIL,

D. Beazley, “Understanding the Python GIL,” inProc. PyCon, Atlanta, GA, Feb. 20, 2010. [Online]. Available: http://www.dabeaz.com/GIL/

work page 2010

[8] [8]

SEDA: An Architecture for Well- Conditioned, Scalable Internet Services,

M. Welsh, D. Culler, and E. Brewer, “SEDA: An Architecture for Well- Conditioned, Scalable Internet Services,” inProc. 18th ACM Symposium on Operating Systems Principles (SOSP), 2001, pp. 230-243

work page 2001

[9] [9]

Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters,

C. Delimitrou and C. Kozyrakis, “Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters,” inProc. 18th ACM International Con- ference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2013, pp. 77-88

work page 2013

[10] [10]

Clipper: A Low-Latency Online Prediction Serving System,

D. Crankshaw et al., “Clipper: A Low-Latency Online Prediction Serving System,” inProc. 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2017, pp. 613-627

work page 2017

[11] [11]

PEP 703: Making the Global Interpreter Lock Optional in CPython,

S. Gross, “PEP 703: Making the Global Interpreter Lock Optional in CPython,” Python Enhancement Proposals, 2023

work page 2023

[12] [12]

Why Threads Are A Bad Idea (for most purposes),

J. Ousterhout, “Why Threads Are A Bad Idea (for most purposes),” inInvited Talk at USENIX Annual Technical Conference, San Diego, CA, Jan. 1996. [Online]. Available: https://web.stanford.edu/ ouster/cgi- bin/papers/threads.pdf

work page 1996

[13] [13]

Cortex-A72 Software Optimization Guide,

ARM Ltd., “Cortex-A72 Software Optimization Guide,” ARM Documentation, 2015. [Online]. Available: https://developer.arm.com/documentation/uan0016/a/

work page 2015

[14] [14]

TensorFlow-Serving: Flexible, High-Performance ML Serving,

C. Olston et al., “TensorFlow-Serving: Flexible, High-Performance ML Serving,”Workshop on ML Systems at NIPS 2017. Available: http://learningsys.org/nips17/assets/papers/paper 1.pdf

work page 2017

[15] [15]

BATCH: Machine Learning Inference Serving on Serverless Platforms,

B. Li et al., “BATCH: Machine Learning Inference Serving on Serverless Platforms,” inProc. International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’20), 2020

work page 2020

[16] [16]

INFERBENCH: Understanding Deep Learning Inference Serving Systems,

H. Yang et al., “INFERBENCH: Understanding Deep Learning Inference Serving Systems,”arXiv preprint arXiv:2011.02327, 2020

work page arXiv 2011

[17] [17]

Ray: A Distributed Framework for Emerging AI Applications,

P. Moritz et al., “Ray: A Distributed Framework for Emerging AI Applications,” inProc. 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18), 2018, pp. 561-577

work page 2018

[18] [18]

Dask: Parallel Computation with Blocked Algorithms and Task Scheduling,

M. Rocklin, “Dask: Parallel Computation with Blocked Algorithms and Task Scheduling,” inProc. 14th Python in Science Conf., 2015, pp. 130- 136

work page 2015

[19] [19]

Runtime vs Scheduler: Analyzing Dask’s Overheads,

S. B ¨ohm and J. Ber ´anek, “Runtime vs Scheduler: Analyzing Dask’s Overheads,”arXiv preprint arXiv:2010.11105, 2020

work page arXiv 2010

[20] [20]

Gevent Documentation,

D. Bilenko, “Gevent Documentation,” 2024. [Online]. Available: http://www.gevent.org/

work page 2024

[21] [21]

Greenlet: Lightweight concurrent programming,

A. Borzenkov, “Greenlet: Lightweight concurrent programming,” 2024. [Online]. Available: https://github.com/python-greenlet/greenlet

work page 2024

[22] [22]

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference,

B. Jacob et al., “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference,” inProc. IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704-2713

work page 2018

[23] [23]

Reaching for the Sky: Maximizing Deep Learning Inference Throughput on Edge Devices with AI Multi-Tenancy,

J. Hao, P. Subedi, L. Ramaswamy, and I. K. Kim, “Reaching for the Sky: Maximizing Deep Learning Inference Throughput on Edge Devices with AI Multi-Tenancy,”ACM Trans. Internet Technol., vol. 22, no. 4, Art. 95, 2023

work page 2023

[24] [24]

Multi-Model Running Latency Optimization in an Edge Computing Paradigm,

P. Li et al., “Multi-Model Running Latency Optimization in an Edge Computing Paradigm,”Sensors, vol. 22, no. 16, p. 6097, 2022

work page 2022

[25] [25]

Cross-Platform Optimization of ONNX Models for Mobile and Edge Deployment,

C. Joshua et al., “Cross-Platform Optimization of ONNX Models for Mobile and Edge Deployment,” ResearchGate preprint, June 2025

work page 2025

[26] [26]

Intelligent Edge Computing and Machine Learning: A Survey of Optimization and Applications,

S. A. Cajas Ord ´o˜nez et al., “Intelligent Edge Computing and Machine Learning: A Survey of Optimization and Applications,”Future Internet, vol. 17, no. 9, p. 417, 2025

work page 2025

[27] [27]

Lightweight Transformer Architectures for Edge De- vices in Real-Time Applications,

H. H. Samson, “Lightweight Transformer Architectures for Edge De- vices in Real-Time Applications,”arXiv preprint arXiv:2601.03290, 2026

work page arXiv 2026

[28] [28]

ADAPT-T: An Adaptive Algorithm for Auto-Tuning Worker Thread Pool Size in Application Servers,

N. Costa et al., “ADAPT-T: An Adaptive Algorithm for Auto-Tuning Worker Thread Pool Size in Application Servers,” inProc. IEEE Symposium on Computers and Communications (ISCC), 2019, pp. 1- 6

work page 2019

[29] [29]

Loh, Mark Oskin, and Steven K

V . Podolskiy, A. Jindal, and M. Gerndt, “IaaS Reactive Autoscal- ing Performance Challenges,” inProc. 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), 2018, pp. 539-546. DOI: 10.1109/CLOUD.2018.00075

work page doi:10.1109/cloud.2018.00075 2018

[30] [30]

Towards resource-efficient reactive and proactive auto-scaling for microservice architectures,

H. Ahmad et al., “Towards resource-efficient reactive and proactive auto-scaling for microservice architectures,”J. Syst. Softw., vol. 225, p. 112390, 2025

work page 2025

[31] [31]

Analysis of Optimal Thread Pool Size,

Y . Ling, T. Mullen, and X. Lin, “Analysis of Optimal Thread Pool Size,” ACM SIGOPS Operating Systems Review, vol. 34, no. 2, pp. 42-55, 2000

work page 2000

[32] [32]

Understanding Linux IOWait,

P. Zaitsev, “Understanding Linux IOWait,” Percona Blog, 2023. [Online]. Available: https://www.percona.com/blog/understanding-linux-iowait/

work page 2023

[33] [33]

Identifying On-/Off-CPU Bottlenecks Together with Blocked Samples,

M. Ahn et al., “Identifying On-/Off-CPU Bottlenecks Together with Blocked Samples,” inProc. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’24), 2024

work page 2024

[34] [34]

A Novel Predictive and Self-Adaptive Dynamic Thread Pool Management,

S. Lee, T. Pham, and F. Bahadur, “A Novel Predictive and Self-Adaptive Dynamic Thread Pool Management,” inProc. 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011, pp. 2001-2008. DOI: 10.1109/IPDPS.2011.353

work page doi:10.1109/ipdps.2011.353 2011

[35] [35]

Performance Study and Dynamic Optimization Design for Thread Pool Systems,

D. Xu, “Performance Study and Dynamic Optimization Design for Thread Pool Systems,” Master’s thesis, Florida State Univ., Tallahassee, FL, 2004

work page 2004