pith. sign in

arxiv: 2603.09046 · v3 · pith:EMQ7WYLFnew · submitted 2026-03-10 · 💻 cs.CR · cs.LG· cs.OS

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Pith reviewed 2026-05-15 14:18 UTC · model grok-4.3

classification 💻 cs.CR cs.LGcs.OS
keywords mobile LLM servingTrustZonesecure inferenceflexible isolationon-device AITTFT optimizationmulti-model scheduling
0
0 comments X

The pith

FlexServe allows ARM TrustZone to protect mobile LLM inference by switching memory and NPU modes on demand, cutting time to first token by over 10x versus rigid baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FlexServe to reduce the slowdown that TrustZone protection imposes on device-side LLM inference. Standard TrustZone isolation of memory and the NPU creates high overhead when shielding model weights and user data from a compromised OS kernel. FlexServe adds a mechanism that lets both memory pages and the NPU flip rapidly between protected and unprotected states. It then layers an LLM-aware memory manager, a secure inference pipeline, and a multi-model scheduler on top of this flexibility. The resulting system targets the gap between the privacy promise of on-device LLMs and the performance cost that currently makes them impractical.

Core claim

FlexServe constructs Flexible Secure Memory and Flexible Secure NPU through a Flexible Resource Isolation mechanism that supports fast mode switches. Inside TrustZone's secure world it adds LLM-Aware Memory Management and a Secure Inference Pipeline for single-model acceleration, plus a Multi-Model Scheduler for agent-style workflows. Prototype measurements show these changes produce large reductions in inference latency compared with both basic and pipeline-enabled TrustZone strawman designs.

What carries the argument

Flexible Resource Isolation mechanism that switches memory pages and the NPU between unprotected and protected modes

Load-bearing premise

The overhead and security properties of rapid mode switches between protected and unprotected states remain stable when measured on production mobile hardware and under realistic kernel attacks.

What would settle it

If benchmarks on additional devices with live kernel exploits show that mode-switch latency or data exposure exceeds the reported gains, the central speedup and security claims would fail.

Figures

Figures reproduced from arXiv: 2603.09046 by Jinyu Gu, Lixiang Wang, Yinpeng Wu, Yitong Chen, Yubin Xia, Zhichao Hua.

Figure 1
Figure 1. Figure 1: Latency of allocating memory with different sizes [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Breakdown of the TTFTs of normal-world inference [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: System overview of FlexServe: The Flex-Monitor constructs the Flex-Mem and Flex-NPU, and the FlexServe Framework provides a fast and secure LLM inference framework. model weights and input/output are protected. All normal￾world applications are considered untrusted. FlexServe as￾sumes the initial kernel code is benign and that secure boot protects its integrity. However, the kernel may contain bugs and cou… view at source ↗
Figure 4
Figure 4. Figure 4: Memory Protection of FlexServe. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: TTFT with different input lengths and models. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Decode throughput with different models. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: TTFT under varying background memory pressure. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: TTFT of different model groups on real-world benchmarks with a 4GB model cache. UC: UltraChat, OA: [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Response latency of real-world agent workflows. [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance overhead to the SQLite. cores. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
read the original abstract

Device-side Large Language Models (LLMs) have witnessed explosive growth, offering higher privacy and availability compared to cloud-side LLMs. During LLM inference, both model weights and user data are valuable, and attackers may even compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead due to its inflexible isolation of memory and the NPU. To address these challenges, this paper introduces FlexServe, a fast and secure LLM serving system for mobile devices. It first introduces a Flexible Resource Isolation mechanism to construct Flexible Secure Memory (Flex-Mem) and Flexible Secure NPU (Flex-NPU). Both memory pages and the NPU can be efficiently switched between unprotected and protected modes. Based on these mechanisms, FlexServe designs a fast and secure LLM inference framework within TrustZone's secure world. The LLM-Aware Memory Management and Secure Inference Pipeline are introduced to accelerate inference. A Multi-Model Scheduler is proposed to optimize multi-model workflows. We implement a prototype of FlexServe and compare it with two TrustZone-based strawman designs. The results show that FlexServe achieves an average $10.05\times$ speedup in Time to First Token (TTFT) compared to the strawman, and an average $2.44\times$ TTFT speedup compared to an optimized strawman with pipeline and secure NPU enabled. For multi-model agent workflows, the end-to-end speedup is up to $24.30\times$ and $4.05\times$ compared to the strawman and optimized strawman, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents FlexServe, a secure LLM serving system for mobile devices that uses ARM TrustZone with a new Flexible Resource Isolation mechanism. This enables efficient dynamic switching of memory pages (Flex-Mem) and the NPU (Flex-NPU) between protected and unprotected modes. Building on these, the system adds LLM-Aware Memory Management, a Secure Inference Pipeline, and a Multi-Model Scheduler. A prototype implementation is evaluated against two TrustZone-based strawman designs, reporting average TTFT speedups of 10.05× versus the basic strawman and 2.44× versus an optimized strawman (with pipeline and secure NPU), plus end-to-end gains up to 24.30× and 4.05× for multi-model agent workflows.

Significance. If the performance claims are supported by complete characterization of mode-switching costs, this work would be significant for practical on-device LLM deployment. It directly addresses the tension between strong hardware isolation (TrustZone) and inference efficiency on resource-constrained mobile devices, offering a concrete prototype that demonstrates flexible isolation can deliver substantial speedups while maintaining security guarantees.

major comments (2)
  1. [Evaluation] Evaluation section: The headline TTFT claims (10.05× vs strawman, 2.44× vs optimized strawman) and multi-model gains (up to 24.30× / 4.05×) attribute improvements to Flexible Resource Isolation, yet no microbenchmark data, switch counts per inference step, or ablation isolating Flex-Mem/Flex-NPU switching latency from LLM-Aware Memory Management or the pipeline is provided. Without these, it is impossible to confirm that mode-switching overheads (e.g., TLB invalidation or NPU reconfiguration) are negligible relative to inference time.
  2. [§4.3] §4.3 (Secure Inference Pipeline): The integration of Flex-NPU mode switching with pipeline stages is described at a high level, but the paper does not quantify reconfiguration costs or their accumulation across token generation steps. This is load-bearing for the central claim that flexible isolation accelerates inference without eroding the reported speedups.
minor comments (2)
  1. [Abstract] The abstract and introduction refer to 'strawman designs' without a concise summary of their key limitations; adding one sentence would improve accessibility for readers.
  2. [Evaluation] Performance figures lack error bars, standard deviations, or details on workload selection and measurement methodology, which are standard for empirical systems papers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the evaluation. We agree that additional microbenchmark data and quantifications will strengthen the paper and will revise the manuscript accordingly to address both major points.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The headline TTFT claims (10.05× vs strawman, 2.44× vs optimized strawman) and multi-model gains (up to 24.30× / 4.05×) attribute improvements to Flexible Resource Isolation, yet no microbenchmark data, switch counts per inference step, or ablation isolating Flex-Mem/Flex-NPU switching latency from LLM-Aware Memory Management or the pipeline is provided. Without these, it is impossible to confirm that mode-switching overheads (e.g., TLB invalidation or NPU reconfiguration) are negligible relative to inference time.

    Authors: We agree that microbenchmark data would better isolate contributions and confirm negligible overheads. In the revised manuscript we will add: (1) microbenchmarks measuring Flex-Mem and Flex-NPU switching latencies including TLB invalidation and NPU reconfiguration costs; (2) the exact number of mode switches per inference step for representative workloads; and (3) an ablation study separating Flexible Resource Isolation from LLM-Aware Memory Management and the pipeline. These additions will directly show that switching costs remain negligible relative to inference time and support the reported speedups. revision: yes

  2. Referee: [§4.3] §4.3 (Secure Inference Pipeline): The integration of Flex-NPU mode switching with pipeline stages is described at a high level, but the paper does not quantify reconfiguration costs or their accumulation across token generation steps. This is load-bearing for the central claim that flexible isolation accelerates inference without eroding the reported speedups.

    Authors: We acknowledge the need for explicit quantification. In the revision we will expand §4.3 with measured Flex-NPU reconfiguration latencies and an analysis of their cumulative impact across successive token-generation steps. The new data will demonstrate that these costs do not erode the overall speedups delivered by flexible isolation, thereby reinforcing the central performance claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical prototype benchmarks

full rationale

The paper describes a systems implementation (Flexible Resource Isolation, LLM-Aware Memory Management, Secure Inference Pipeline, Multi-Model Scheduler) and reports measured speedups from a prototype against strawman baselines. No equations, first-principles derivations, or predictions appear that reduce by construction to fitted inputs or self-referential definitions. Performance numbers are direct experimental results, not outputs of any model that was calibrated on the same quantities. Self-citations, if present, are not load-bearing for the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the introduction of two new mechanisms (Flex-Mem and Flex-NPU) without independent evidence beyond the prototype. It relies on the standard assumption that TrustZone provides effective isolation.

axioms (1)
  • domain assumption ARM TrustZone provides hardware-based isolation between secure and normal worlds that protects against a compromised OS kernel.
    Invoked as the foundation for all secure inference claims.
invented entities (2)
  • Flex-Mem no independent evidence
    purpose: Flexible secure memory that can be efficiently switched between protected and unprotected modes.
    New mechanism introduced to reduce isolation overhead for LLM weights and data.
  • Flex-NPU no independent evidence
    purpose: Flexible secure NPU that can be efficiently switched between protected and unprotected modes.
    New mechanism introduced to reduce overhead for AI acceleration during secure inference.

pith-pipeline@v0.9.0 · 5627 in / 1409 out tokens · 68263 ms · 2026-05-15T14:18:35.483668+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SoK: Attack and Defense Landscape of Mobile On-device AI Systems

    cs.CR 2026-07 unverdicted novelty 7.0

    This SoK paper introduces the first systematic framework covering security pillars, attack landscape, and defense landscape for mobile on-device AI systems while identifying research gaps.