pith. sign in

arxiv: 2602.17808 · v2 · submitted 2026-02-19 · 💻 cs.DC · cs.PF

Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs

Pith reviewed 2026-05-15 20:29 UTC · model grok-4.3

classification 💻 cs.DC cs.PF
keywords edge tpucollaborative inferencemodel partitioningmulti-tenant workloadsqueueing modelswapping overheadon-device ailatency reduction
0
0 comments X

The pith

SwapLess uses a queueing model to dynamically partition inference between CPU and Edge TPU, cutting mean latency by up to 77.4% in multi-tenant workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SwapLess, a runtime system that performs collaborative inference by splitting model execution across CPU and memory-limited Edge TPU resources. It relies on an analytic queueing model to select partition points and CPU core counts online, accounting for service times and swapping costs under different workload mixes and arrival rates. The goal is to lower end-to-end response time without hardware modifications, which matters for IoT applications that must run AI models locally under tight memory and connectivity constraints. Implementation results show substantial latency drops relative to the default compiler for both single- and multi-tenant cases.

Core claim

SwapLess is a system for adaptive, multi-tenant TPU-CPU collaborative inference on memory-constrained Edge TPUs. It utilizes an analytic queueing model that captures partition-dependent CPU/TPU service times as well as inter- and intra-model swapping overheads across varying workload mixes and request rates. Using this model, SwapLess continuously adjusts both the partition point and CPU core allocation online to minimize end-to-end response time with low decision overhead. An implementation demonstrates mean latency reductions of up to 63.8% for single-tenant workloads and up to 77.4% for multi-tenant workloads relative to the default Edge TPU compiler.

What carries the argument

The analytic queueing model that predicts partition-dependent service times and swapping overheads to drive online decisions on partition points and CPU core allocations.

If this is right

  • Swapping overhead drops enough to allow larger models or higher request rates on the same hardware.
  • Multi-tenant sharing of the Edge TPU becomes feasible without severe interference between models.
  • The online adjustments maintain low overhead while responding to changes in workload mix and arrival rate.
  • No changes to model architecture or compiler are required beyond the runtime partitioning logic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same queueing approach could be tested on other memory-constrained accelerators such as mobile NPUs to check whether similar latency gains appear.
  • Extending the model to include power draw would allow joint optimization of latency and energy on battery-powered devices.
  • In deployments with highly bursty traffic, the decision frequency might need tuning to keep overhead negligible.

Load-bearing premise

The analytic queueing model accurately captures the actual CPU and TPU service times plus swapping overheads for real hardware across changing partition choices, workload mixes, and request rates.

What would settle it

Run the system on Edge TPU hardware with measured latencies that deviate by more than a small margin from the queueing model's predictions at multiple partition points and request rates; if the observed response times do not improve as predicted, the claim that the model enables minimal end-to-end latency would not hold.

Figures

Figures reproduced from arXiv: 2602.17808 by Ayush Gupta, Balachandra Sunil, David Irwin, Nathan Ng, Prashanthi Kadambi, Prashant Shenoy, Walid A. Hanafy, Yogesh Simmhan.

Figure 1
Figure 1. Figure 1: Intra-model memory swapping overhead can contribute up to 62% of total TPU inference latency. MobiNet:SqzNet (50:50 Mix) EffiNet:GPUNet (50:50 Mix) EffiNet:GPUNet (90:10 Mix) Workload Mix 0 10 20 30 40 Mean Latency (ms) MobileNetV2 SqueezeNet EfficientNet GPUNet EfficientNet GPUNet Service Time Swapping Overhead [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of SwapLess architecture. III. SwapLess DESIGN To address the above challenges, we propose SwapLess, a system designed to optimize multi-tenant inference on Edge TPU-equipped devices. This section provides an overview of the SwapLess design, introduces our proposed analytic queuing model that captures the effects of model partitioning, resource allocation, and memory swapping, and outlines its alg… view at source ↗
Figure 5
Figure 5. Figure 5: Model validation for single AI model deployments: [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Model validation for multi-tenant deployments: (a) Validation of the [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Latency comparison between SwapLess and baselines under different TPU utilization ρ. achieve similar performance as executing the workload on TPU does not incur swapping overhead. Conversely, for workloads whose memory footprint exceeds TPU capacity, SwapLess demonstrates significant advantages. Under low utilization (ρ = 0.2), SwapLess reduces mean latency by up to 56.2% in single-tenant settings and 68.0… view at source ↗
Figure 8
Figure 8. Figure 8: Performance under dynamic request rates for MnasNet [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
read the original abstract

IoT applications increasingly rely on on-device AI accelerators to ensure high performance, especially in low-connectivity and safety-critical scenarios. However, the limited on-chip memory of these accelerators forces inference runtimes to swap model segments between host and accelerator memory, incurring significant swapping overheads. While collaborative processing by partitioning model execution across CPU and accelerator resources can reduce accelerator memory pressure and execution overhead, naive partitioning may worsen end-to-end latency by either shifting excessive computation to the CPU or failing to sufficiently reduce swapping, a problem that is further exacerbated in multi-tenant and dynamic environments. To address these issues, we present SwapLess, a system for adaptive, multi-tenant TPU-CPU collaborative inference on memory-constrained Edge TPUs. SwapLess utilizes an analytic queueing model that captures partition-dependent CPU/TPU service times as well as inter- and intra-model swapping overheads across different workload mixes and request rates. Using this model, SwapLess continuously adjusts both the partition point and CPU core allocation online to minimize end-to-end response time with low decision overhead. An implementation on Edge TPU-equipped platforms demonstrates that SwapLess reduces mean latency by up to 63.8% for single-tenant workloads and up to 77.4% for multi-tenant workloads relative to the default Edge TPU compiler.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents SwapLess, a system for adaptive multi-tenant collaborative inference on memory-constrained Edge TPUs. It employs an analytic queueing model that accounts for partition-dependent CPU/TPU service times and inter/intra-model swapping overheads to dynamically select partition points and CPU core allocations online, with the goal of minimizing end-to-end response time. An implementation on Edge TPU platforms is reported to achieve mean latency reductions of up to 63.8% for single-tenant workloads and 77.4% for multi-tenant workloads relative to the default Edge TPU compiler.

Significance. If the queueing model remains accurate under real multi-tenant hardware contention, the approach could meaningfully improve latency for on-device AI in resource-constrained, multi-tenant IoT settings by reducing unnecessary swapping through principled partitioning. The concrete gains from a real implementation on Edge TPU hardware constitute a practical strength; the analytic (rather than purely empirical) nature of the decision model is also a positive attribute for reproducibility and low-overhead online use.

major comments (1)
  1. [Evaluation] Evaluation section: The central performance claims rest on the queueing model correctly predicting partition-dependent service times and swap costs for arbitrary workload mixes and arrival rates. However, the manuscript provides no quantitative validation of model accuracy (e.g., prediction error, R², or residual plots) against measured hardware behavior under concurrent tenants, nor does it report sensitivity to unmodeled effects such as memory-bus contention or TPU scheduling jitter. Without such evidence, it is unclear whether the reported latency reductions are robust or could degrade when the model drives suboptimal partitions.
minor comments (1)
  1. [Abstract] The abstract and introduction would benefit from a brief statement of the workload generation methodology and the number of runs used to obtain the reported latency figures.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on the evaluation of SwapLess. We address the major comment point by point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: The central performance claims rest on the queueing model correctly predicting partition-dependent service times and swap costs for arbitrary workload mixes and arrival rates. However, the manuscript provides no quantitative validation of model accuracy (e.g., prediction error, R², or residual plots) against measured hardware behavior under concurrent tenants, nor does it report sensitivity to unmodeled effects such as memory-bus contention or TPU scheduling jitter. Without such evidence, it is unclear whether the reported latency reductions are robust or could degrade when the model drives suboptimal partitions.

    Authors: We agree that the manuscript would benefit from explicit quantitative validation of the queueing model's predictive accuracy. The reported latency reductions are based on direct hardware measurements using partitions selected by the model, and the consistent improvements across single- and multi-tenant workloads provide supporting evidence that the model captures the primary effects. However, to directly address this concern, we will add a new subsection in the Evaluation section of the revised manuscript. This subsection will include: (1) quantitative comparisons of predicted versus measured CPU/TPU service times and inter/intra-model swap costs for the evaluated workload mixes, reporting metrics such as mean absolute percentage error (MAPE) and R²; (2) residual analysis where feasible; and (3) sensitivity discussion based on additional experiments examining memory-bus contention and TPU scheduling jitter under varying tenant counts. These additions will clarify the model's accuracy and robustness without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SwapLess derivation chain

full rationale

The paper constructs an analytic queueing model from first-principles descriptions of partition-dependent CPU/TPU service times and swapping overheads, then applies the model to drive online partition-point and core-allocation decisions. The reported latency reductions (up to 63.8% single-tenant, 77.4% multi-tenant) are obtained from direct hardware measurements on Edge TPU platforms against the default compiler baseline. Because the model is used only for runtime control and the performance claims rest on external empirical results rather than on any fitted parameter being renamed as a prediction or on self-referential definitions, no load-bearing step reduces to its own inputs by construction. The derivation remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view; the queueing model is presumed to rest on standard queueing assumptions plus measured service times, but no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5569 in / 1078 out tokens · 32896 ms · 2026-05-15T20:29:21.272470+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    Empowering Edge Intelligence: A Comprehensive Survey on On-Device AI Models,

    X. Wanget al., “Empowering Edge Intelligence: A Comprehensive Survey on On-Device AI Models,”ACM Comput. Surv., vol. 57, Apr. 2025

  2. [2]

    Does Accurate Real- Time AI Need Edge Offload?,

    Q. Dong, J. Xu, P. Pillai, and M. Satyanarayanan, “Does Accurate Real- Time AI Need Edge Offload?,” inSEC ’25, 2025

  3. [3]

    Model-driven Cluster Resource Management for AI Workloads in Edge Clouds,

    Q. Lianget al., “Model-driven Cluster Resource Management for AI Workloads in Edge Clouds,”ACM Trans. Auton. Adapt. Syst., vol. 18, Mar. 2023

  4. [4]

    Collaborative Inference in Resource-Constrained Edge Networks: Challenges and Opportunities,

    N. Nget al., “Collaborative Inference in Resource-Constrained Edge Networks: Challenges and Opportunities,” inMILCOM’24, 2024

  5. [5]

    Will Distributed Computing Revolutionize Peace? The Emergence of Battlefield IoT,

    T. Abdelzaheret al., “Will Distributed Computing Revolutionize Peace? The Emergence of Battlefield IoT,” inICDCS, pp. 1129–1138, 2018

  6. [6]

    Edge TPU: Run Inference at the Edge

    Google, “Edge TPU: Run Inference at the Edge.” https://www.coral.ai/ docs/edgetpu/inference/, 2019. Accessed: 2026-01-23

  7. [7]

    Raspberry Pi AI HAT+

    “Raspberry Pi AI HAT+.” https://www.raspberrypi.com/products/ai-hat/. Accessed: 2026-01-23

  8. [8]

    A Performance Prediction-based DNN Partitioner for Edge TPU Pipelining,

    B. Zouet al., “A Performance Prediction-based DNN Partitioner for Edge TPU Pipelining,” inMILCOM 2024, pp. 1–6, 2024

  9. [9]

    RESPECT: Reinforcement Learning Based Edge Schedul- ing on Pipelined Coral Edge TPUs,

    J. Yinet al., “RESPECT: Reinforcement Learning Based Edge Schedul- ing on Pipelined Coral Edge TPUs,” inDAC’23, p. 1–6, 2025

  10. [10]

    Improving inference time in multi-TPU systems with profiled model segmentation,

    J. Villarrubiaet al., “Improving inference time in multi-TPU systems with profiled model segmentation,” inPDP’23, pp. 84–91, 2023

  11. [11]

    Exact Memory- and Communication-aware Scheduling of DNNs on Pipelined Edge TPUs,

    J. Yinet al., “Exact Memory- and Communication-aware Scheduling of DNNs on Pipelined Edge TPUs,” inSEC’22, pp. 203–215, 2022

  12. [12]

    Enabling Real-time AI Inference on Mobile Devices via GPU-CPU Collaborative Execution,

    H. Liet al., “Enabling Real-time AI Inference on Mobile Devices via GPU-CPU Collaborative Execution,” inRTCSA’22, 2022

  13. [13]

    Work in Progress: Real-time Transformer Inference on Edge AI Accelerators,

    B. Reidyet al., “Work in Progress: Real-time Transformer Inference on Edge AI Accelerators,” inRTAS’23, pp. 341–344, 2023

  14. [14]

    Harchol-Balter,Performance Modeling and Design of Computer Systems: Queueing Theory in Action

    M. Harchol-Balter,Performance Modeling and Design of Computer Systems: Queueing Theory in Action. Cambridge University Press, 2013

  15. [15]

    Adaptive Alert Management for Balancing Optimal Performance among Distributed CSOCs using Reinforcement Learning,

    A. Shahet al., “Adaptive Alert Management for Balancing Optimal Performance among Distributed CSOCs using Reinforcement Learning,” IEEE Trans. Parallel Distrib. Syst., vol. 31, p. 16–33, Jan. 2020

  16. [16]

    ONNX GraphSurgeon

    NVIDIA, “ONNX GraphSurgeon.” https://pypi.org/project/ onnx-graphsurgeon/, 2025. Accessed: 2026-01-20

  17. [17]

    tf.lite.TFLiteConverter

    “tf.lite.TFLiteConverter.” https://www.tensorflow.org/api docs/python/ tf/lite/TFLiteConverter, 2024. Accessed: 2026-01-21

  18. [18]

    Edge TPU Compiler

    Coral AI, “Edge TPU Compiler.” https://www.coral.ai/docs/edgetpu/ compiler. Accessed: 2026-01-19

  19. [19]

    FastDeepIoT: Towards Understanding and Optimizing Neural Network Execution Time on Mobile and Embedded Devices,

    S. Yaoet al., “FastDeepIoT: Towards Understanding and Optimizing Neural Network Execution Time on Mobile and Embedded Devices,” in SenSys’18, 2018

  20. [20]

    Pipeline a Model with Multiple Edge TPUs

    “Pipeline a Model with Multiple Edge TPUs.” https://gweb-coral-full. uc.r.appspot.com/docs/edgetpu/pipeline/. Accessed: 2026-01-21

  21. [21]

    Sapar: A surrogate-assisted dnn partitioner for efficient inferences on edge tpu pipelines,

    B. Sunet al., “Sapar: A surrogate-assisted dnn partitioner for efficient inferences on edge tpu pipelines,”ACM Trans. Embed. Comput. Syst., vol. 24, Sept. 2025

  22. [22]

    DeepX: a software accelerator for low-power deep learning inference on mobile devices,

    N. D. Laneet al., “DeepX: a software accelerator for low-power deep learning inference on mobile devices,” inIPSN’16, IEEE Press, 2016

  23. [23]

    µlayer: Low latency on-device inference using cooper- ative single-layer acceleration and processor-friendly quantization,

    Y . Kimet al., “µlayer: Low latency on-device inference using cooper- ative single-layer acceleration and processor-friendly quantization,” in EuroSys’19, 2019

  24. [24]

    CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices,

    F. Jiaet al., “CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices,” inMobiSys’22, p. 209–221, 2022

  25. [25]

    BlastNet: Exploiting Duo-Blocks for Cross-Processor Real-Time DNN Inference,

    N. Linget al., “BlastNet: Exploiting Duo-Blocks for Cross-Processor Real-Time DNN Inference,” inSenSys’22, p. 91–105, 2023

  26. [26]

    Band: coordinated multi-dnn inference on heteroge- neous mobile processors,

    J. S. Jeonget al., “Band: coordinated multi-dnn inference on heteroge- neous mobile processors,” inMobiSys’22, p. 235–247, 2022

  27. [27]

    Flex: Fast, Accurate DNN Inference on Low-Cost Edges Using Heterogeneous Accelerator Execution,

    T. Senet al., “Flex: Fast, Accurate DNN Inference on Low-Cost Edges Using Heterogeneous Accelerator Execution,” inEuroSys’25, p. 507–523, 2025

  28. [28]

    Potluck: Cross-Application Approximate Deduplica- tion for Computation-Intensive Mobile Applications,

    P. Guoet al., “Potluck: Cross-Application Approximate Deduplica- tion for Computation-Intensive Mobile Applications,”SIGPLAN Not., vol. 53, p. 271–284, Mar. 2018

  29. [29]

    Mainstream: Dynamic Stem-Sharing for Multi- Tenant video processing,

    A. H. Jianget al., “Mainstream: Dynamic Stem-Sharing for Multi- Tenant video processing,” inUSENIX ATC’18, pp. 29–42, July 2018

  30. [30]

    NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision,

    B. Fanget al., “NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision,” inMobiCom’18, 2018

  31. [31]

    LegoDNN: block-grained scaling of deep neural networks for mobile vision,

    R. Hanet al., “LegoDNN: block-grained scaling of deep neural networks for mobile vision,” inMobiCom’21, p. 406–419, 2021

  32. [32]

    POS: An Operator Scheduling Framework for Multi- model Inference on Edge Intelligent Computing,

    Z. Zhanget al., “POS: An Operator Scheduling Framework for Multi- model Inference on Edge Intelligent Computing,” inIPSN’23, p. 1, 2023