Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs

Ayush Gupta; Balachandra Sunil; David Irwin; Nathan Ng; Prashanthi Kadambi; Prashant Shenoy; Walid A. Hanafy; Yogesh Simmhan

arxiv: 2602.17808 · v2 · submitted 2026-02-19 · 💻 cs.DC · cs.PF

Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs

Nathan Ng , Walid A. Hanafy , Prashanthi Kadambi , Balachandra Sunil , Ayush Gupta , David Irwin , Yogesh Simmhan , Prashant Shenoy This is my paper

Pith reviewed 2026-05-15 20:29 UTC · model grok-4.3

classification 💻 cs.DC cs.PF

keywords edge tpucollaborative inferencemodel partitioningmulti-tenant workloadsqueueing modelswapping overheadon-device ailatency reduction

0 comments

The pith

SwapLess uses a queueing model to dynamically partition inference between CPU and Edge TPU, cutting mean latency by up to 77.4% in multi-tenant workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SwapLess, a runtime system that performs collaborative inference by splitting model execution across CPU and memory-limited Edge TPU resources. It relies on an analytic queueing model to select partition points and CPU core counts online, accounting for service times and swapping costs under different workload mixes and arrival rates. The goal is to lower end-to-end response time without hardware modifications, which matters for IoT applications that must run AI models locally under tight memory and connectivity constraints. Implementation results show substantial latency drops relative to the default compiler for both single- and multi-tenant cases.

Core claim

SwapLess is a system for adaptive, multi-tenant TPU-CPU collaborative inference on memory-constrained Edge TPUs. It utilizes an analytic queueing model that captures partition-dependent CPU/TPU service times as well as inter- and intra-model swapping overheads across varying workload mixes and request rates. Using this model, SwapLess continuously adjusts both the partition point and CPU core allocation online to minimize end-to-end response time with low decision overhead. An implementation demonstrates mean latency reductions of up to 63.8% for single-tenant workloads and up to 77.4% for multi-tenant workloads relative to the default Edge TPU compiler.

What carries the argument

The analytic queueing model that predicts partition-dependent service times and swapping overheads to drive online decisions on partition points and CPU core allocations.

If this is right

Swapping overhead drops enough to allow larger models or higher request rates on the same hardware.
Multi-tenant sharing of the Edge TPU becomes feasible without severe interference between models.
The online adjustments maintain low overhead while responding to changes in workload mix and arrival rate.
No changes to model architecture or compiler are required beyond the runtime partitioning logic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same queueing approach could be tested on other memory-constrained accelerators such as mobile NPUs to check whether similar latency gains appear.
Extending the model to include power draw would allow joint optimization of latency and energy on battery-powered devices.
In deployments with highly bursty traffic, the decision frequency might need tuning to keep overhead negligible.

Load-bearing premise

The analytic queueing model accurately captures the actual CPU and TPU service times plus swapping overheads for real hardware across changing partition choices, workload mixes, and request rates.

What would settle it

Run the system on Edge TPU hardware with measured latencies that deviate by more than a small margin from the queueing model's predictions at multiple partition points and request rates; if the observed response times do not improve as predicted, the claim that the model enables minimal end-to-end latency would not hold.

Figures

Figures reproduced from arXiv: 2602.17808 by Ayush Gupta, Balachandra Sunil, David Irwin, Nathan Ng, Prashanthi Kadambi, Prashant Shenoy, Walid A. Hanafy, Yogesh Simmhan.

**Figure 1.** Figure 1: Intra-model memory swapping overhead can contribute up to 62% of total TPU inference latency. MobiNet:SqzNet (50:50 Mix) EffiNet:GPUNet (50:50 Mix) EffiNet:GPUNet (90:10 Mix) Workload Mix 0 10 20 30 40 Mean Latency (ms) MobileNetV2 SqueezeNet EfficientNet GPUNet EfficientNet GPUNet Service Time Swapping Overhead [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 4.** Figure 4: Overview of SwapLess architecture. III. SwapLess DESIGN To address the above challenges, we propose SwapLess, a system designed to optimize multi-tenant inference on Edge TPU-equipped devices. This section provides an overview of the SwapLess design, introduces our proposed analytic queuing model that captures the effects of model partitioning, resource allocation, and memory swapping, and outlines its alg… view at source ↗

**Figure 5.** Figure 5: Model validation for single AI model deployments: [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Model validation for multi-tenant deployments: (a) Validation of the [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Latency comparison between SwapLess and baselines under different TPU utilization ρ. achieve similar performance as executing the workload on TPU does not incur swapping overhead. Conversely, for workloads whose memory footprint exceeds TPU capacity, SwapLess demonstrates significant advantages. Under low utilization (ρ = 0.2), SwapLess reduces mean latency by up to 56.2% in single-tenant settings and 68.0… view at source ↗

**Figure 8.** Figure 8: Performance under dynamic request rates for MnasNet [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

IoT applications increasingly rely on on-device AI accelerators to ensure high performance, especially in low-connectivity and safety-critical scenarios. However, the limited on-chip memory of these accelerators forces inference runtimes to swap model segments between host and accelerator memory, incurring significant swapping overheads. While collaborative processing by partitioning model execution across CPU and accelerator resources can reduce accelerator memory pressure and execution overhead, naive partitioning may worsen end-to-end latency by either shifting excessive computation to the CPU or failing to sufficiently reduce swapping, a problem that is further exacerbated in multi-tenant and dynamic environments. To address these issues, we present SwapLess, a system for adaptive, multi-tenant TPU-CPU collaborative inference on memory-constrained Edge TPUs. SwapLess utilizes an analytic queueing model that captures partition-dependent CPU/TPU service times as well as inter- and intra-model swapping overheads across different workload mixes and request rates. Using this model, SwapLess continuously adjusts both the partition point and CPU core allocation online to minimize end-to-end response time with low decision overhead. An implementation on Edge TPU-equipped platforms demonstrates that SwapLess reduces mean latency by up to 63.8% for single-tenant workloads and up to 77.4% for multi-tenant workloads relative to the default Edge TPU compiler.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SwapLess uses a queueing model for adaptive TPU-CPU partitioning on memory-limited edge devices and reports solid latency gains, but the model's accuracy under contention needs clearer validation.

read the letter

SwapLess uses a queueing model to dynamically choose model partitions and CPU allocations for multi-tenant inference on Edge TPUs, claiming up to 77% latency reduction in multi-tenant cases. The new part is the online, low-overhead adaptation based on an analytic model that factors in service times and swap costs for different workloads. This builds on prior partitioning work by making it responsive to runtime conditions without much overhead. The real hardware implementation is a strength, showing concrete gains over the default compiler. The main soft spot is around the model's reliability. The abstract reports big improvements but skips details on validation, error bars, or how workloads were generated. If contention effects like memory bus interference aren't well captured, the decisions could be off and the numbers might not hold in broader tests. The stress-test concern about model error under concurrent tenants seems worth digging into. This is for people building systems for edge AI in IoT or similar constrained environments. A reader interested in resource management for accelerators would find the approach worth looking at. I think it should go to peer review. The practical results and the model-driven adaptation make it worth a referee's time, even with the need for more experimental transparency.

Referee Report

1 major / 1 minor

Summary. The paper presents SwapLess, a system for adaptive multi-tenant collaborative inference on memory-constrained Edge TPUs. It employs an analytic queueing model that accounts for partition-dependent CPU/TPU service times and inter/intra-model swapping overheads to dynamically select partition points and CPU core allocations online, with the goal of minimizing end-to-end response time. An implementation on Edge TPU platforms is reported to achieve mean latency reductions of up to 63.8% for single-tenant workloads and 77.4% for multi-tenant workloads relative to the default Edge TPU compiler.

Significance. If the queueing model remains accurate under real multi-tenant hardware contention, the approach could meaningfully improve latency for on-device AI in resource-constrained, multi-tenant IoT settings by reducing unnecessary swapping through principled partitioning. The concrete gains from a real implementation on Edge TPU hardware constitute a practical strength; the analytic (rather than purely empirical) nature of the decision model is also a positive attribute for reproducibility and low-overhead online use.

major comments (1)

[Evaluation] Evaluation section: The central performance claims rest on the queueing model correctly predicting partition-dependent service times and swap costs for arbitrary workload mixes and arrival rates. However, the manuscript provides no quantitative validation of model accuracy (e.g., prediction error, R², or residual plots) against measured hardware behavior under concurrent tenants, nor does it report sensitivity to unmodeled effects such as memory-bus contention or TPU scheduling jitter. Without such evidence, it is unclear whether the reported latency reductions are robust or could degrade when the model drives suboptimal partitions.

minor comments (1)

[Abstract] The abstract and introduction would benefit from a brief statement of the workload generation methodology and the number of runs used to obtain the reported latency figures.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on the evaluation of SwapLess. We address the major comment point by point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: The central performance claims rest on the queueing model correctly predicting partition-dependent service times and swap costs for arbitrary workload mixes and arrival rates. However, the manuscript provides no quantitative validation of model accuracy (e.g., prediction error, R², or residual plots) against measured hardware behavior under concurrent tenants, nor does it report sensitivity to unmodeled effects such as memory-bus contention or TPU scheduling jitter. Without such evidence, it is unclear whether the reported latency reductions are robust or could degrade when the model drives suboptimal partitions.

Authors: We agree that the manuscript would benefit from explicit quantitative validation of the queueing model's predictive accuracy. The reported latency reductions are based on direct hardware measurements using partitions selected by the model, and the consistent improvements across single- and multi-tenant workloads provide supporting evidence that the model captures the primary effects. However, to directly address this concern, we will add a new subsection in the Evaluation section of the revised manuscript. This subsection will include: (1) quantitative comparisons of predicted versus measured CPU/TPU service times and inter/intra-model swap costs for the evaluated workload mixes, reporting metrics such as mean absolute percentage error (MAPE) and R²; (2) residual analysis where feasible; and (3) sensitivity discussion based on additional experiments examining memory-bus contention and TPU scheduling jitter under varying tenant counts. These additions will clarify the model's accuracy and robustness without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SwapLess derivation chain

full rationale

The paper constructs an analytic queueing model from first-principles descriptions of partition-dependent CPU/TPU service times and swapping overheads, then applies the model to drive online partition-point and core-allocation decisions. The reported latency reductions (up to 63.8% single-tenant, 77.4% multi-tenant) are obtained from direct hardware measurements on Edge TPU platforms against the default compiler baseline. Because the model is used only for runtime control and the performance claims rest on external empirical results rather than on any fitted parameter being renamed as a prediction or on self-referential definitions, no load-bearing step reduces to its own inputs by construction. The derivation remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view; the queueing model is presumed to rest on standard queueing assumptions plus measured service times, but no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5569 in / 1078 out tokens · 32896 ms · 2026-05-15T20:29:21.272470+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SwapLess utilizes an analytic queueing model that captures partition-dependent CPU/TPU service times as well as inter- and intra-model swapping overheads... E[W_TPU] = λ_TPU E[(s_TPU)^2] / 2(1-ρ_TPU)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

Empowering Edge Intelligence: A Comprehensive Survey on On-Device AI Models,

X. Wanget al., “Empowering Edge Intelligence: A Comprehensive Survey on On-Device AI Models,”ACM Comput. Surv., vol. 57, Apr. 2025

work page 2025
[2]

Does Accurate Real- Time AI Need Edge Offload?,

Q. Dong, J. Xu, P. Pillai, and M. Satyanarayanan, “Does Accurate Real- Time AI Need Edge Offload?,” inSEC ’25, 2025

work page 2025
[3]

Model-driven Cluster Resource Management for AI Workloads in Edge Clouds,

Q. Lianget al., “Model-driven Cluster Resource Management for AI Workloads in Edge Clouds,”ACM Trans. Auton. Adapt. Syst., vol. 18, Mar. 2023

work page 2023
[4]

Collaborative Inference in Resource-Constrained Edge Networks: Challenges and Opportunities,

N. Nget al., “Collaborative Inference in Resource-Constrained Edge Networks: Challenges and Opportunities,” inMILCOM’24, 2024

work page 2024
[5]

Will Distributed Computing Revolutionize Peace? The Emergence of Battlefield IoT,

T. Abdelzaheret al., “Will Distributed Computing Revolutionize Peace? The Emergence of Battlefield IoT,” inICDCS, pp. 1129–1138, 2018

work page 2018
[6]

Edge TPU: Run Inference at the Edge

Google, “Edge TPU: Run Inference at the Edge.” https://www.coral.ai/ docs/edgetpu/inference/, 2019. Accessed: 2026-01-23

work page 2019
[7]

Raspberry Pi AI HAT+

“Raspberry Pi AI HAT+.” https://www.raspberrypi.com/products/ai-hat/. Accessed: 2026-01-23

work page 2026
[8]

A Performance Prediction-based DNN Partitioner for Edge TPU Pipelining,

B. Zouet al., “A Performance Prediction-based DNN Partitioner for Edge TPU Pipelining,” inMILCOM 2024, pp. 1–6, 2024

work page 2024
[9]

RESPECT: Reinforcement Learning Based Edge Schedul- ing on Pipelined Coral Edge TPUs,

J. Yinet al., “RESPECT: Reinforcement Learning Based Edge Schedul- ing on Pipelined Coral Edge TPUs,” inDAC’23, p. 1–6, 2025

work page 2025
[10]

Improving inference time in multi-TPU systems with profiled model segmentation,

J. Villarrubiaet al., “Improving inference time in multi-TPU systems with profiled model segmentation,” inPDP’23, pp. 84–91, 2023

work page 2023
[11]

Exact Memory- and Communication-aware Scheduling of DNNs on Pipelined Edge TPUs,

J. Yinet al., “Exact Memory- and Communication-aware Scheduling of DNNs on Pipelined Edge TPUs,” inSEC’22, pp. 203–215, 2022

work page 2022
[12]

Enabling Real-time AI Inference on Mobile Devices via GPU-CPU Collaborative Execution,

H. Liet al., “Enabling Real-time AI Inference on Mobile Devices via GPU-CPU Collaborative Execution,” inRTCSA’22, 2022

work page 2022
[13]

Work in Progress: Real-time Transformer Inference on Edge AI Accelerators,

B. Reidyet al., “Work in Progress: Real-time Transformer Inference on Edge AI Accelerators,” inRTAS’23, pp. 341–344, 2023

work page 2023
[14]

Harchol-Balter,Performance Modeling and Design of Computer Systems: Queueing Theory in Action

M. Harchol-Balter,Performance Modeling and Design of Computer Systems: Queueing Theory in Action. Cambridge University Press, 2013

work page 2013
[15]

Adaptive Alert Management for Balancing Optimal Performance among Distributed CSOCs using Reinforcement Learning,

A. Shahet al., “Adaptive Alert Management for Balancing Optimal Performance among Distributed CSOCs using Reinforcement Learning,” IEEE Trans. Parallel Distrib. Syst., vol. 31, p. 16–33, Jan. 2020

work page 2020
[16]

ONNX GraphSurgeon

NVIDIA, “ONNX GraphSurgeon.” https://pypi.org/project/ onnx-graphsurgeon/, 2025. Accessed: 2026-01-20

work page 2025
[17]

tf.lite.TFLiteConverter

“tf.lite.TFLiteConverter.” https://www.tensorflow.org/api docs/python/ tf/lite/TFLiteConverter, 2024. Accessed: 2026-01-21

work page 2024
[18]

Edge TPU Compiler

Coral AI, “Edge TPU Compiler.” https://www.coral.ai/docs/edgetpu/ compiler. Accessed: 2026-01-19

work page 2026
[19]

FastDeepIoT: Towards Understanding and Optimizing Neural Network Execution Time on Mobile and Embedded Devices,

S. Yaoet al., “FastDeepIoT: Towards Understanding and Optimizing Neural Network Execution Time on Mobile and Embedded Devices,” in SenSys’18, 2018

work page 2018
[20]

Pipeline a Model with Multiple Edge TPUs

“Pipeline a Model with Multiple Edge TPUs.” https://gweb-coral-full. uc.r.appspot.com/docs/edgetpu/pipeline/. Accessed: 2026-01-21

work page 2026
[21]

Sapar: A surrogate-assisted dnn partitioner for efficient inferences on edge tpu pipelines,

B. Sunet al., “Sapar: A surrogate-assisted dnn partitioner for efficient inferences on edge tpu pipelines,”ACM Trans. Embed. Comput. Syst., vol. 24, Sept. 2025

work page 2025
[22]

DeepX: a software accelerator for low-power deep learning inference on mobile devices,

N. D. Laneet al., “DeepX: a software accelerator for low-power deep learning inference on mobile devices,” inIPSN’16, IEEE Press, 2016

work page 2016
[23]

µlayer: Low latency on-device inference using cooper- ative single-layer acceleration and processor-friendly quantization,

Y . Kimet al., “µlayer: Low latency on-device inference using cooper- ative single-layer acceleration and processor-friendly quantization,” in EuroSys’19, 2019

work page 2019
[24]

CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices,

F. Jiaet al., “CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices,” inMobiSys’22, p. 209–221, 2022

work page 2022
[25]

BlastNet: Exploiting Duo-Blocks for Cross-Processor Real-Time DNN Inference,

N. Linget al., “BlastNet: Exploiting Duo-Blocks for Cross-Processor Real-Time DNN Inference,” inSenSys’22, p. 91–105, 2023

work page 2023
[26]

Band: coordinated multi-dnn inference on heteroge- neous mobile processors,

J. S. Jeonget al., “Band: coordinated multi-dnn inference on heteroge- neous mobile processors,” inMobiSys’22, p. 235–247, 2022

work page 2022
[27]

Flex: Fast, Accurate DNN Inference on Low-Cost Edges Using Heterogeneous Accelerator Execution,

T. Senet al., “Flex: Fast, Accurate DNN Inference on Low-Cost Edges Using Heterogeneous Accelerator Execution,” inEuroSys’25, p. 507–523, 2025

work page 2025
[28]

Potluck: Cross-Application Approximate Deduplica- tion for Computation-Intensive Mobile Applications,

P. Guoet al., “Potluck: Cross-Application Approximate Deduplica- tion for Computation-Intensive Mobile Applications,”SIGPLAN Not., vol. 53, p. 271–284, Mar. 2018

work page 2018
[29]

Mainstream: Dynamic Stem-Sharing for Multi- Tenant video processing,

A. H. Jianget al., “Mainstream: Dynamic Stem-Sharing for Multi- Tenant video processing,” inUSENIX ATC’18, pp. 29–42, July 2018

work page 2018
[30]

NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision,

B. Fanget al., “NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision,” inMobiCom’18, 2018

work page 2018
[31]

LegoDNN: block-grained scaling of deep neural networks for mobile vision,

R. Hanet al., “LegoDNN: block-grained scaling of deep neural networks for mobile vision,” inMobiCom’21, p. 406–419, 2021

work page 2021
[32]

POS: An Operator Scheduling Framework for Multi- model Inference on Edge Intelligent Computing,

Z. Zhanget al., “POS: An Operator Scheduling Framework for Multi- model Inference on Edge Intelligent Computing,” inIPSN’23, p. 1, 2023

work page 2023

[1] [1]

Empowering Edge Intelligence: A Comprehensive Survey on On-Device AI Models,

X. Wanget al., “Empowering Edge Intelligence: A Comprehensive Survey on On-Device AI Models,”ACM Comput. Surv., vol. 57, Apr. 2025

work page 2025

[2] [2]

Does Accurate Real- Time AI Need Edge Offload?,

Q. Dong, J. Xu, P. Pillai, and M. Satyanarayanan, “Does Accurate Real- Time AI Need Edge Offload?,” inSEC ’25, 2025

work page 2025

[3] [3]

Model-driven Cluster Resource Management for AI Workloads in Edge Clouds,

Q. Lianget al., “Model-driven Cluster Resource Management for AI Workloads in Edge Clouds,”ACM Trans. Auton. Adapt. Syst., vol. 18, Mar. 2023

work page 2023

[4] [4]

Collaborative Inference in Resource-Constrained Edge Networks: Challenges and Opportunities,

N. Nget al., “Collaborative Inference in Resource-Constrained Edge Networks: Challenges and Opportunities,” inMILCOM’24, 2024

work page 2024

[5] [5]

Will Distributed Computing Revolutionize Peace? The Emergence of Battlefield IoT,

T. Abdelzaheret al., “Will Distributed Computing Revolutionize Peace? The Emergence of Battlefield IoT,” inICDCS, pp. 1129–1138, 2018

work page 2018

[6] [6]

Edge TPU: Run Inference at the Edge

Google, “Edge TPU: Run Inference at the Edge.” https://www.coral.ai/ docs/edgetpu/inference/, 2019. Accessed: 2026-01-23

work page 2019

[7] [7]

Raspberry Pi AI HAT+

“Raspberry Pi AI HAT+.” https://www.raspberrypi.com/products/ai-hat/. Accessed: 2026-01-23

work page 2026

[8] [8]

A Performance Prediction-based DNN Partitioner for Edge TPU Pipelining,

B. Zouet al., “A Performance Prediction-based DNN Partitioner for Edge TPU Pipelining,” inMILCOM 2024, pp. 1–6, 2024

work page 2024

[9] [9]

RESPECT: Reinforcement Learning Based Edge Schedul- ing on Pipelined Coral Edge TPUs,

J. Yinet al., “RESPECT: Reinforcement Learning Based Edge Schedul- ing on Pipelined Coral Edge TPUs,” inDAC’23, p. 1–6, 2025

work page 2025

[10] [10]

Improving inference time in multi-TPU systems with profiled model segmentation,

J. Villarrubiaet al., “Improving inference time in multi-TPU systems with profiled model segmentation,” inPDP’23, pp. 84–91, 2023

work page 2023

[11] [11]

Exact Memory- and Communication-aware Scheduling of DNNs on Pipelined Edge TPUs,

J. Yinet al., “Exact Memory- and Communication-aware Scheduling of DNNs on Pipelined Edge TPUs,” inSEC’22, pp. 203–215, 2022

work page 2022

[12] [12]

Enabling Real-time AI Inference on Mobile Devices via GPU-CPU Collaborative Execution,

H. Liet al., “Enabling Real-time AI Inference on Mobile Devices via GPU-CPU Collaborative Execution,” inRTCSA’22, 2022

work page 2022

[13] [13]

Work in Progress: Real-time Transformer Inference on Edge AI Accelerators,

B. Reidyet al., “Work in Progress: Real-time Transformer Inference on Edge AI Accelerators,” inRTAS’23, pp. 341–344, 2023

work page 2023

[14] [14]

Harchol-Balter,Performance Modeling and Design of Computer Systems: Queueing Theory in Action

M. Harchol-Balter,Performance Modeling and Design of Computer Systems: Queueing Theory in Action. Cambridge University Press, 2013

work page 2013

[15] [15]

Adaptive Alert Management for Balancing Optimal Performance among Distributed CSOCs using Reinforcement Learning,

A. Shahet al., “Adaptive Alert Management for Balancing Optimal Performance among Distributed CSOCs using Reinforcement Learning,” IEEE Trans. Parallel Distrib. Syst., vol. 31, p. 16–33, Jan. 2020

work page 2020

[16] [16]

ONNX GraphSurgeon

NVIDIA, “ONNX GraphSurgeon.” https://pypi.org/project/ onnx-graphsurgeon/, 2025. Accessed: 2026-01-20

work page 2025

[17] [17]

tf.lite.TFLiteConverter

“tf.lite.TFLiteConverter.” https://www.tensorflow.org/api docs/python/ tf/lite/TFLiteConverter, 2024. Accessed: 2026-01-21

work page 2024

[18] [18]

Edge TPU Compiler

Coral AI, “Edge TPU Compiler.” https://www.coral.ai/docs/edgetpu/ compiler. Accessed: 2026-01-19

work page 2026

[19] [19]

FastDeepIoT: Towards Understanding and Optimizing Neural Network Execution Time on Mobile and Embedded Devices,

S. Yaoet al., “FastDeepIoT: Towards Understanding and Optimizing Neural Network Execution Time on Mobile and Embedded Devices,” in SenSys’18, 2018

work page 2018

[20] [20]

Pipeline a Model with Multiple Edge TPUs

“Pipeline a Model with Multiple Edge TPUs.” https://gweb-coral-full. uc.r.appspot.com/docs/edgetpu/pipeline/. Accessed: 2026-01-21

work page 2026

[21] [21]

Sapar: A surrogate-assisted dnn partitioner for efficient inferences on edge tpu pipelines,

B. Sunet al., “Sapar: A surrogate-assisted dnn partitioner for efficient inferences on edge tpu pipelines,”ACM Trans. Embed. Comput. Syst., vol. 24, Sept. 2025

work page 2025

[22] [22]

DeepX: a software accelerator for low-power deep learning inference on mobile devices,

N. D. Laneet al., “DeepX: a software accelerator for low-power deep learning inference on mobile devices,” inIPSN’16, IEEE Press, 2016

work page 2016

[23] [23]

µlayer: Low latency on-device inference using cooper- ative single-layer acceleration and processor-friendly quantization,

Y . Kimet al., “µlayer: Low latency on-device inference using cooper- ative single-layer acceleration and processor-friendly quantization,” in EuroSys’19, 2019

work page 2019

[24] [24]

CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices,

F. Jiaet al., “CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices,” inMobiSys’22, p. 209–221, 2022

work page 2022

[25] [25]

BlastNet: Exploiting Duo-Blocks for Cross-Processor Real-Time DNN Inference,

N. Linget al., “BlastNet: Exploiting Duo-Blocks for Cross-Processor Real-Time DNN Inference,” inSenSys’22, p. 91–105, 2023

work page 2023

[26] [26]

Band: coordinated multi-dnn inference on heteroge- neous mobile processors,

J. S. Jeonget al., “Band: coordinated multi-dnn inference on heteroge- neous mobile processors,” inMobiSys’22, p. 235–247, 2022

work page 2022

[27] [27]

Flex: Fast, Accurate DNN Inference on Low-Cost Edges Using Heterogeneous Accelerator Execution,

T. Senet al., “Flex: Fast, Accurate DNN Inference on Low-Cost Edges Using Heterogeneous Accelerator Execution,” inEuroSys’25, p. 507–523, 2025

work page 2025

[28] [28]

Potluck: Cross-Application Approximate Deduplica- tion for Computation-Intensive Mobile Applications,

P. Guoet al., “Potluck: Cross-Application Approximate Deduplica- tion for Computation-Intensive Mobile Applications,”SIGPLAN Not., vol. 53, p. 271–284, Mar. 2018

work page 2018

[29] [29]

Mainstream: Dynamic Stem-Sharing for Multi- Tenant video processing,

A. H. Jianget al., “Mainstream: Dynamic Stem-Sharing for Multi- Tenant video processing,” inUSENIX ATC’18, pp. 29–42, July 2018

work page 2018

[30] [30]

NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision,

B. Fanget al., “NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision,” inMobiCom’18, 2018

work page 2018

[31] [31]

LegoDNN: block-grained scaling of deep neural networks for mobile vision,

R. Hanet al., “LegoDNN: block-grained scaling of deep neural networks for mobile vision,” inMobiCom’21, p. 406–419, 2021

work page 2021

[32] [32]

POS: An Operator Scheduling Framework for Multi- model Inference on Edge Intelligent Computing,

Z. Zhanget al., “POS: An Operator Scheduling Framework for Multi- model Inference on Edge Intelligent Computing,” inIPSN’23, p. 1, 2023

work page 2023