Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs
Pith reviewed 2026-05-15 20:29 UTC · model grok-4.3
The pith
SwapLess uses a queueing model to dynamically partition inference between CPU and Edge TPU, cutting mean latency by up to 77.4% in multi-tenant workloads.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SwapLess is a system for adaptive, multi-tenant TPU-CPU collaborative inference on memory-constrained Edge TPUs. It utilizes an analytic queueing model that captures partition-dependent CPU/TPU service times as well as inter- and intra-model swapping overheads across varying workload mixes and request rates. Using this model, SwapLess continuously adjusts both the partition point and CPU core allocation online to minimize end-to-end response time with low decision overhead. An implementation demonstrates mean latency reductions of up to 63.8% for single-tenant workloads and up to 77.4% for multi-tenant workloads relative to the default Edge TPU compiler.
What carries the argument
The analytic queueing model that predicts partition-dependent service times and swapping overheads to drive online decisions on partition points and CPU core allocations.
If this is right
- Swapping overhead drops enough to allow larger models or higher request rates on the same hardware.
- Multi-tenant sharing of the Edge TPU becomes feasible without severe interference between models.
- The online adjustments maintain low overhead while responding to changes in workload mix and arrival rate.
- No changes to model architecture or compiler are required beyond the runtime partitioning logic.
Where Pith is reading between the lines
- The same queueing approach could be tested on other memory-constrained accelerators such as mobile NPUs to check whether similar latency gains appear.
- Extending the model to include power draw would allow joint optimization of latency and energy on battery-powered devices.
- In deployments with highly bursty traffic, the decision frequency might need tuning to keep overhead negligible.
Load-bearing premise
The analytic queueing model accurately captures the actual CPU and TPU service times plus swapping overheads for real hardware across changing partition choices, workload mixes, and request rates.
What would settle it
Run the system on Edge TPU hardware with measured latencies that deviate by more than a small margin from the queueing model's predictions at multiple partition points and request rates; if the observed response times do not improve as predicted, the claim that the model enables minimal end-to-end latency would not hold.
Figures
read the original abstract
IoT applications increasingly rely on on-device AI accelerators to ensure high performance, especially in low-connectivity and safety-critical scenarios. However, the limited on-chip memory of these accelerators forces inference runtimes to swap model segments between host and accelerator memory, incurring significant swapping overheads. While collaborative processing by partitioning model execution across CPU and accelerator resources can reduce accelerator memory pressure and execution overhead, naive partitioning may worsen end-to-end latency by either shifting excessive computation to the CPU or failing to sufficiently reduce swapping, a problem that is further exacerbated in multi-tenant and dynamic environments. To address these issues, we present SwapLess, a system for adaptive, multi-tenant TPU-CPU collaborative inference on memory-constrained Edge TPUs. SwapLess utilizes an analytic queueing model that captures partition-dependent CPU/TPU service times as well as inter- and intra-model swapping overheads across different workload mixes and request rates. Using this model, SwapLess continuously adjusts both the partition point and CPU core allocation online to minimize end-to-end response time with low decision overhead. An implementation on Edge TPU-equipped platforms demonstrates that SwapLess reduces mean latency by up to 63.8% for single-tenant workloads and up to 77.4% for multi-tenant workloads relative to the default Edge TPU compiler.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents SwapLess, a system for adaptive multi-tenant collaborative inference on memory-constrained Edge TPUs. It employs an analytic queueing model that accounts for partition-dependent CPU/TPU service times and inter/intra-model swapping overheads to dynamically select partition points and CPU core allocations online, with the goal of minimizing end-to-end response time. An implementation on Edge TPU platforms is reported to achieve mean latency reductions of up to 63.8% for single-tenant workloads and 77.4% for multi-tenant workloads relative to the default Edge TPU compiler.
Significance. If the queueing model remains accurate under real multi-tenant hardware contention, the approach could meaningfully improve latency for on-device AI in resource-constrained, multi-tenant IoT settings by reducing unnecessary swapping through principled partitioning. The concrete gains from a real implementation on Edge TPU hardware constitute a practical strength; the analytic (rather than purely empirical) nature of the decision model is also a positive attribute for reproducibility and low-overhead online use.
major comments (1)
- [Evaluation] Evaluation section: The central performance claims rest on the queueing model correctly predicting partition-dependent service times and swap costs for arbitrary workload mixes and arrival rates. However, the manuscript provides no quantitative validation of model accuracy (e.g., prediction error, R², or residual plots) against measured hardware behavior under concurrent tenants, nor does it report sensitivity to unmodeled effects such as memory-bus contention or TPU scheduling jitter. Without such evidence, it is unclear whether the reported latency reductions are robust or could degrade when the model drives suboptimal partitions.
minor comments (1)
- [Abstract] The abstract and introduction would benefit from a brief statement of the workload generation methodology and the number of runs used to obtain the reported latency figures.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on the evaluation of SwapLess. We address the major comment point by point below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: The central performance claims rest on the queueing model correctly predicting partition-dependent service times and swap costs for arbitrary workload mixes and arrival rates. However, the manuscript provides no quantitative validation of model accuracy (e.g., prediction error, R², or residual plots) against measured hardware behavior under concurrent tenants, nor does it report sensitivity to unmodeled effects such as memory-bus contention or TPU scheduling jitter. Without such evidence, it is unclear whether the reported latency reductions are robust or could degrade when the model drives suboptimal partitions.
Authors: We agree that the manuscript would benefit from explicit quantitative validation of the queueing model's predictive accuracy. The reported latency reductions are based on direct hardware measurements using partitions selected by the model, and the consistent improvements across single- and multi-tenant workloads provide supporting evidence that the model captures the primary effects. However, to directly address this concern, we will add a new subsection in the Evaluation section of the revised manuscript. This subsection will include: (1) quantitative comparisons of predicted versus measured CPU/TPU service times and inter/intra-model swap costs for the evaluated workload mixes, reporting metrics such as mean absolute percentage error (MAPE) and R²; (2) residual analysis where feasible; and (3) sensitivity discussion based on additional experiments examining memory-bus contention and TPU scheduling jitter under varying tenant counts. These additions will clarify the model's accuracy and robustness without altering the core claims. revision: yes
Circularity Check
No significant circularity in SwapLess derivation chain
full rationale
The paper constructs an analytic queueing model from first-principles descriptions of partition-dependent CPU/TPU service times and swapping overheads, then applies the model to drive online partition-point and core-allocation decisions. The reported latency reductions (up to 63.8% single-tenant, 77.4% multi-tenant) are obtained from direct hardware measurements on Edge TPU platforms against the default compiler baseline. Because the model is used only for runtime control and the performance claims rest on external empirical results rather than on any fitted parameter being renamed as a prediction or on self-referential definitions, no load-bearing step reduces to its own inputs by construction. The derivation remains self-contained.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SwapLess utilizes an analytic queueing model that captures partition-dependent CPU/TPU service times as well as inter- and intra-model swapping overheads... E[W_TPU] = λ_TPU E[(s_TPU)^2] / 2(1-ρ_TPU)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Empowering Edge Intelligence: A Comprehensive Survey on On-Device AI Models,
X. Wanget al., “Empowering Edge Intelligence: A Comprehensive Survey on On-Device AI Models,”ACM Comput. Surv., vol. 57, Apr. 2025
work page 2025
-
[2]
Does Accurate Real- Time AI Need Edge Offload?,
Q. Dong, J. Xu, P. Pillai, and M. Satyanarayanan, “Does Accurate Real- Time AI Need Edge Offload?,” inSEC ’25, 2025
work page 2025
-
[3]
Model-driven Cluster Resource Management for AI Workloads in Edge Clouds,
Q. Lianget al., “Model-driven Cluster Resource Management for AI Workloads in Edge Clouds,”ACM Trans. Auton. Adapt. Syst., vol. 18, Mar. 2023
work page 2023
-
[4]
Collaborative Inference in Resource-Constrained Edge Networks: Challenges and Opportunities,
N. Nget al., “Collaborative Inference in Resource-Constrained Edge Networks: Challenges and Opportunities,” inMILCOM’24, 2024
work page 2024
-
[5]
Will Distributed Computing Revolutionize Peace? The Emergence of Battlefield IoT,
T. Abdelzaheret al., “Will Distributed Computing Revolutionize Peace? The Emergence of Battlefield IoT,” inICDCS, pp. 1129–1138, 2018
work page 2018
-
[6]
Edge TPU: Run Inference at the Edge
Google, “Edge TPU: Run Inference at the Edge.” https://www.coral.ai/ docs/edgetpu/inference/, 2019. Accessed: 2026-01-23
work page 2019
-
[7]
“Raspberry Pi AI HAT+.” https://www.raspberrypi.com/products/ai-hat/. Accessed: 2026-01-23
work page 2026
-
[8]
A Performance Prediction-based DNN Partitioner for Edge TPU Pipelining,
B. Zouet al., “A Performance Prediction-based DNN Partitioner for Edge TPU Pipelining,” inMILCOM 2024, pp. 1–6, 2024
work page 2024
-
[9]
RESPECT: Reinforcement Learning Based Edge Schedul- ing on Pipelined Coral Edge TPUs,
J. Yinet al., “RESPECT: Reinforcement Learning Based Edge Schedul- ing on Pipelined Coral Edge TPUs,” inDAC’23, p. 1–6, 2025
work page 2025
-
[10]
Improving inference time in multi-TPU systems with profiled model segmentation,
J. Villarrubiaet al., “Improving inference time in multi-TPU systems with profiled model segmentation,” inPDP’23, pp. 84–91, 2023
work page 2023
-
[11]
Exact Memory- and Communication-aware Scheduling of DNNs on Pipelined Edge TPUs,
J. Yinet al., “Exact Memory- and Communication-aware Scheduling of DNNs on Pipelined Edge TPUs,” inSEC’22, pp. 203–215, 2022
work page 2022
-
[12]
Enabling Real-time AI Inference on Mobile Devices via GPU-CPU Collaborative Execution,
H. Liet al., “Enabling Real-time AI Inference on Mobile Devices via GPU-CPU Collaborative Execution,” inRTCSA’22, 2022
work page 2022
-
[13]
Work in Progress: Real-time Transformer Inference on Edge AI Accelerators,
B. Reidyet al., “Work in Progress: Real-time Transformer Inference on Edge AI Accelerators,” inRTAS’23, pp. 341–344, 2023
work page 2023
-
[14]
Harchol-Balter,Performance Modeling and Design of Computer Systems: Queueing Theory in Action
M. Harchol-Balter,Performance Modeling and Design of Computer Systems: Queueing Theory in Action. Cambridge University Press, 2013
work page 2013
-
[15]
A. Shahet al., “Adaptive Alert Management for Balancing Optimal Performance among Distributed CSOCs using Reinforcement Learning,” IEEE Trans. Parallel Distrib. Syst., vol. 31, p. 16–33, Jan. 2020
work page 2020
-
[16]
NVIDIA, “ONNX GraphSurgeon.” https://pypi.org/project/ onnx-graphsurgeon/, 2025. Accessed: 2026-01-20
work page 2025
-
[17]
“tf.lite.TFLiteConverter.” https://www.tensorflow.org/api docs/python/ tf/lite/TFLiteConverter, 2024. Accessed: 2026-01-21
work page 2024
-
[18]
Coral AI, “Edge TPU Compiler.” https://www.coral.ai/docs/edgetpu/ compiler. Accessed: 2026-01-19
work page 2026
-
[19]
S. Yaoet al., “FastDeepIoT: Towards Understanding and Optimizing Neural Network Execution Time on Mobile and Embedded Devices,” in SenSys’18, 2018
work page 2018
-
[20]
Pipeline a Model with Multiple Edge TPUs
“Pipeline a Model with Multiple Edge TPUs.” https://gweb-coral-full. uc.r.appspot.com/docs/edgetpu/pipeline/. Accessed: 2026-01-21
work page 2026
-
[21]
Sapar: A surrogate-assisted dnn partitioner for efficient inferences on edge tpu pipelines,
B. Sunet al., “Sapar: A surrogate-assisted dnn partitioner for efficient inferences on edge tpu pipelines,”ACM Trans. Embed. Comput. Syst., vol. 24, Sept. 2025
work page 2025
-
[22]
DeepX: a software accelerator for low-power deep learning inference on mobile devices,
N. D. Laneet al., “DeepX: a software accelerator for low-power deep learning inference on mobile devices,” inIPSN’16, IEEE Press, 2016
work page 2016
-
[23]
Y . Kimet al., “µlayer: Low latency on-device inference using cooper- ative single-layer acceleration and processor-friendly quantization,” in EuroSys’19, 2019
work page 2019
-
[24]
CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices,
F. Jiaet al., “CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices,” inMobiSys’22, p. 209–221, 2022
work page 2022
-
[25]
BlastNet: Exploiting Duo-Blocks for Cross-Processor Real-Time DNN Inference,
N. Linget al., “BlastNet: Exploiting Duo-Blocks for Cross-Processor Real-Time DNN Inference,” inSenSys’22, p. 91–105, 2023
work page 2023
-
[26]
Band: coordinated multi-dnn inference on heteroge- neous mobile processors,
J. S. Jeonget al., “Band: coordinated multi-dnn inference on heteroge- neous mobile processors,” inMobiSys’22, p. 235–247, 2022
work page 2022
-
[27]
Flex: Fast, Accurate DNN Inference on Low-Cost Edges Using Heterogeneous Accelerator Execution,
T. Senet al., “Flex: Fast, Accurate DNN Inference on Low-Cost Edges Using Heterogeneous Accelerator Execution,” inEuroSys’25, p. 507–523, 2025
work page 2025
-
[28]
P. Guoet al., “Potluck: Cross-Application Approximate Deduplica- tion for Computation-Intensive Mobile Applications,”SIGPLAN Not., vol. 53, p. 271–284, Mar. 2018
work page 2018
-
[29]
Mainstream: Dynamic Stem-Sharing for Multi- Tenant video processing,
A. H. Jianget al., “Mainstream: Dynamic Stem-Sharing for Multi- Tenant video processing,” inUSENIX ATC’18, pp. 29–42, July 2018
work page 2018
-
[30]
NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision,
B. Fanget al., “NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision,” inMobiCom’18, 2018
work page 2018
-
[31]
LegoDNN: block-grained scaling of deep neural networks for mobile vision,
R. Hanet al., “LegoDNN: block-grained scaling of deep neural networks for mobile vision,” inMobiCom’21, p. 406–419, 2021
work page 2021
-
[32]
POS: An Operator Scheduling Framework for Multi- model Inference on Edge Intelligent Computing,
Z. Zhanget al., “POS: An Operator Scheduling Framework for Multi- model Inference on Edge Intelligent Computing,” inIPSN’23, p. 1, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.