FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Jinyu Gu; Lixiang Wang; Yinpeng Wu; Yitong Chen; Yubin Xia; Zhichao Hua

arxiv: 2603.09046 · v3 · pith:EMQ7WYLFnew · submitted 2026-03-10 · 💻 cs.CR · cs.LG· cs.OS

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Yinpeng Wu , Yitong Chen , Lixiang Wang , Jinyu Gu , Zhichao Hua , Yubin Xia This is my paper

Pith reviewed 2026-05-15 14:18 UTC · model grok-4.3

classification 💻 cs.CR cs.LGcs.OS

keywords mobile LLM servingTrustZonesecure inferenceflexible isolationon-device AITTFT optimizationmulti-model scheduling

0 comments

The pith

FlexServe allows ARM TrustZone to protect mobile LLM inference by switching memory and NPU modes on demand, cutting time to first token by over 10x versus rigid baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FlexServe to reduce the slowdown that TrustZone protection imposes on device-side LLM inference. Standard TrustZone isolation of memory and the NPU creates high overhead when shielding model weights and user data from a compromised OS kernel. FlexServe adds a mechanism that lets both memory pages and the NPU flip rapidly between protected and unprotected states. It then layers an LLM-aware memory manager, a secure inference pipeline, and a multi-model scheduler on top of this flexibility. The resulting system targets the gap between the privacy promise of on-device LLMs and the performance cost that currently makes them impractical.

Core claim

FlexServe constructs Flexible Secure Memory and Flexible Secure NPU through a Flexible Resource Isolation mechanism that supports fast mode switches. Inside TrustZone's secure world it adds LLM-Aware Memory Management and a Secure Inference Pipeline for single-model acceleration, plus a Multi-Model Scheduler for agent-style workflows. Prototype measurements show these changes produce large reductions in inference latency compared with both basic and pipeline-enabled TrustZone strawman designs.

What carries the argument

Flexible Resource Isolation mechanism that switches memory pages and the NPU between unprotected and protected modes

Load-bearing premise

The overhead and security properties of rapid mode switches between protected and unprotected states remain stable when measured on production mobile hardware and under realistic kernel attacks.

What would settle it

If benchmarks on additional devices with live kernel exploits show that mode-switch latency or data exposure exceeds the reported gains, the central speedup and security claims would fail.

Figures

Figures reproduced from arXiv: 2603.09046 by Jinyu Gu, Lixiang Wang, Yinpeng Wu, Yitong Chen, Yubin Xia, Zhichao Hua.

**Figure 2.** Figure 2: Breakdown of the TTFTs of normal-world inference [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: System overview of FlexServe: The Flex-Monitor constructs the Flex-Mem and Flex-NPU, and the FlexServe Framework provides a fast and secure LLM inference framework. model weights and input/output are protected. All normalworld applications are considered untrusted. FlexServe assumes the initial kernel code is benign and that secure boot protects its integrity. However, the kernel may contain bugs and cou… view at source ↗

**Figure 4.** Figure 4: Memory Protection of FlexServe. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: TTFT with different input lengths and models. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Decode throughput with different models. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: TTFT under varying background memory pressure. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: TTFT of different model groups on real-world benchmarks with a 4GB model cache. UC: UltraChat, OA: [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Response latency of real-world agent workflows. [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 11.** Figure 11: Performance overhead to the SQLite. cores. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

read the original abstract

Device-side Large Language Models (LLMs) have witnessed explosive growth, offering higher privacy and availability compared to cloud-side LLMs. During LLM inference, both model weights and user data are valuable, and attackers may even compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead due to its inflexible isolation of memory and the NPU. To address these challenges, this paper introduces FlexServe, a fast and secure LLM serving system for mobile devices. It first introduces a Flexible Resource Isolation mechanism to construct Flexible Secure Memory (Flex-Mem) and Flexible Secure NPU (Flex-NPU). Both memory pages and the NPU can be efficiently switched between unprotected and protected modes. Based on these mechanisms, FlexServe designs a fast and secure LLM inference framework within TrustZone's secure world. The LLM-Aware Memory Management and Secure Inference Pipeline are introduced to accelerate inference. A Multi-Model Scheduler is proposed to optimize multi-model workflows. We implement a prototype of FlexServe and compare it with two TrustZone-based strawman designs. The results show that FlexServe achieves an average $10.05\times$ speedup in Time to First Token (TTFT) compared to the strawman, and an average $2.44\times$ TTFT speedup compared to an optimized strawman with pipeline and secure NPU enabled. For multi-model agent workflows, the end-to-end speedup is up to $24.30\times$ and $4.05\times$ compared to the strawman and optimized strawman, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlexServe adds flexible TrustZone mode switching for mobile LLM serving and shows solid prototype speedups, but the switching overheads stay unmeasured.

read the letter

The paper's main contribution is a pair of mechanisms, Flex-Mem and Flex-NPU, that let memory pages and the NPU flip between protected and unprotected modes without the usual heavy TrustZone tax. They pair this with LLM-aware memory management, a secure inference pipeline, and a multi-model scheduler. The prototype then runs on real hardware and reports concrete numbers: roughly 10x TTFT over a basic strawman and 2.4x over an optimized one, with bigger gains in multi-model agent flows. That is useful evidence for anyone who has tried to run protected inference on phones and hit the isolation wall. The implementation looks honest; they actually built it and compared against two TrustZone baselines rather than just claiming theoretical wins. The multi-model scheduler is a practical addition that addresses a real workload pattern. The soft spot is exactly what the stress test flags. The speedups rest on the assumption that mode switches are cheap, yet the paper gives no microbenchmark for switch latency, no count of switches per token, and no ablation that isolates the switching cost from the other optimizations. In longer multi-model runs even small per-switch costs could shrink the advantage. The abstract also skips error bars and workload details, so the numbers are harder to judge without the full experimental section. This is for systems people who care about on-device security and performance. It has a working prototype and addresses a clear pain point, so it deserves a serious referee rather than a desk reject. The reviewers can push on the missing overhead data and the experimental rigor, but the core idea is worth the time.

Referee Report

2 major / 2 minor

Summary. The paper presents FlexServe, a secure LLM serving system for mobile devices that uses ARM TrustZone with a new Flexible Resource Isolation mechanism. This enables efficient dynamic switching of memory pages (Flex-Mem) and the NPU (Flex-NPU) between protected and unprotected modes. Building on these, the system adds LLM-Aware Memory Management, a Secure Inference Pipeline, and a Multi-Model Scheduler. A prototype implementation is evaluated against two TrustZone-based strawman designs, reporting average TTFT speedups of 10.05× versus the basic strawman and 2.44× versus an optimized strawman (with pipeline and secure NPU), plus end-to-end gains up to 24.30× and 4.05× for multi-model agent workflows.

Significance. If the performance claims are supported by complete characterization of mode-switching costs, this work would be significant for practical on-device LLM deployment. It directly addresses the tension between strong hardware isolation (TrustZone) and inference efficiency on resource-constrained mobile devices, offering a concrete prototype that demonstrates flexible isolation can deliver substantial speedups while maintaining security guarantees.

major comments (2)

[Evaluation] Evaluation section: The headline TTFT claims (10.05× vs strawman, 2.44× vs optimized strawman) and multi-model gains (up to 24.30× / 4.05×) attribute improvements to Flexible Resource Isolation, yet no microbenchmark data, switch counts per inference step, or ablation isolating Flex-Mem/Flex-NPU switching latency from LLM-Aware Memory Management or the pipeline is provided. Without these, it is impossible to confirm that mode-switching overheads (e.g., TLB invalidation or NPU reconfiguration) are negligible relative to inference time.
[§4.3] §4.3 (Secure Inference Pipeline): The integration of Flex-NPU mode switching with pipeline stages is described at a high level, but the paper does not quantify reconfiguration costs or their accumulation across token generation steps. This is load-bearing for the central claim that flexible isolation accelerates inference without eroding the reported speedups.

minor comments (2)

[Abstract] The abstract and introduction refer to 'strawman designs' without a concise summary of their key limitations; adding one sentence would improve accessibility for readers.
[Evaluation] Performance figures lack error bars, standard deviations, or details on workload selection and measurement methodology, which are standard for empirical systems papers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the evaluation. We agree that additional microbenchmark data and quantifications will strengthen the paper and will revise the manuscript accordingly to address both major points.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The headline TTFT claims (10.05× vs strawman, 2.44× vs optimized strawman) and multi-model gains (up to 24.30× / 4.05×) attribute improvements to Flexible Resource Isolation, yet no microbenchmark data, switch counts per inference step, or ablation isolating Flex-Mem/Flex-NPU switching latency from LLM-Aware Memory Management or the pipeline is provided. Without these, it is impossible to confirm that mode-switching overheads (e.g., TLB invalidation or NPU reconfiguration) are negligible relative to inference time.

Authors: We agree that microbenchmark data would better isolate contributions and confirm negligible overheads. In the revised manuscript we will add: (1) microbenchmarks measuring Flex-Mem and Flex-NPU switching latencies including TLB invalidation and NPU reconfiguration costs; (2) the exact number of mode switches per inference step for representative workloads; and (3) an ablation study separating Flexible Resource Isolation from LLM-Aware Memory Management and the pipeline. These additions will directly show that switching costs remain negligible relative to inference time and support the reported speedups. revision: yes
Referee: [§4.3] §4.3 (Secure Inference Pipeline): The integration of Flex-NPU mode switching with pipeline stages is described at a high level, but the paper does not quantify reconfiguration costs or their accumulation across token generation steps. This is load-bearing for the central claim that flexible isolation accelerates inference without eroding the reported speedups.

Authors: We acknowledge the need for explicit quantification. In the revision we will expand §4.3 with measured Flex-NPU reconfiguration latencies and an analysis of their cumulative impact across successive token-generation steps. The new data will demonstrate that these costs do not erode the overall speedups delivered by flexible isolation, thereby reinforcing the central performance claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical prototype benchmarks

full rationale

The paper describes a systems implementation (Flexible Resource Isolation, LLM-Aware Memory Management, Secure Inference Pipeline, Multi-Model Scheduler) and reports measured speedups from a prototype against strawman baselines. No equations, first-principles derivations, or predictions appear that reduce by construction to fitted inputs or self-referential definitions. Performance numbers are direct experimental results, not outputs of any model that was calibrated on the same quantities. Self-citations, if present, are not load-bearing for the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the introduction of two new mechanisms (Flex-Mem and Flex-NPU) without independent evidence beyond the prototype. It relies on the standard assumption that TrustZone provides effective isolation.

axioms (1)

domain assumption ARM TrustZone provides hardware-based isolation between secure and normal worlds that protects against a compromised OS kernel.
Invoked as the foundation for all secure inference claims.

invented entities (2)

Flex-Mem no independent evidence
purpose: Flexible secure memory that can be efficiently switched between protected and unprotected modes.
New mechanism introduced to reduce isolation overhead for LLM weights and data.
Flex-NPU no independent evidence
purpose: Flexible secure NPU that can be efficiently switched between protected and unprotected modes.
New mechanism introduced to reduce overhead for AI acceleration during secure inference.

pith-pipeline@v0.9.0 · 5627 in / 1409 out tokens · 68263 ms · 2026-05-15T14:18:35.483668+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SoK: Attack and Defense Landscape of Mobile On-device AI Systems
cs.CR 2026-07 unverdicted novelty 7.0

This SoK paper introduces the first systematic framework covering security pillars, attack landscape, and defense landscape for mobile on-device AI systems while identifying research gaps.