FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Jinyu Gu; Lixiang Wang; Yinpeng Wu; Yitong Chen; Yubin Xia; Zhichao Hua

arxiv: 2606.23370 · v2 · pith:5PP4RSV5new · submitted 2026-06-22 · 💻 cs.CR · cs.LG· cs.OS

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Yinpeng Wu , Yitong Chen , Lixiang Wang , Jinyu Gu , Zhichao Hua , Yubin Xia This is my paper

Pith reviewed 2026-07-03 23:09 UTC · model grok-4.3

classification 💻 cs.CR cs.LGcs.OS

keywords LLM inferencemobile devicesTrustZoneresource isolationsecure servingon-device AIARM securitymemory management

0 comments

The pith

FlexServe decouples access from management permissions in TrustZone so the normal OS can allocate secure memory and NPU for LLMs without reading their contents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that TrustZone protections for on-device LLMs suffer from inflexible isolation and slow resource management, creating high overhead for both secure inference and normal apps. FlexServe introduces Recallable Resource Isolation to create Flex-Mem and Flex-NPU that only the secure world can access yet the normal-world OS can still allocate and reclaim efficiently. This decoupling enables cooperative management between the two worlds through a new framework for LLM inference. If the approach holds, private model weights and user data stay protected from a compromised kernel while time-to-first-token latency drops sharply. The reported results show average speedups of 10.05X versus a basic strawman and 2.44X versus an optimized one.

Core claim

FlexServe presents Recallable Resource Isolation to construct Recallable Secure Memory (Flex-Mem) and Recallable Secure NPU (Flex-NPU). These resources remain accessible only to the secure world while the normal-world OS retains normal allocation and reclamation rights. The FlexServe Framework then coordinates with the normal OS for cooperative secure memory management during LLM inference, protecting both model weights and user data from kernel compromise.

What carries the argument

Recallable Resource Isolation, which separates access permission from management permission so the normal OS can handle allocation and reclamation without gaining read or write access.

If this is right

Secure LLM inference runs with substantially lower time-to-first-token latency than prior TrustZone designs.
The normal-world OS can manage secure resources at normal speed without impacting its own applications.
Model weights and user data remain protected from kernel-level attackers throughout inference.
Cooperative memory management between secure and normal worlds becomes feasible for other resource-heavy secure tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling pattern could apply to other on-device accelerators or memory pools beyond NPU.
It may allow secure-world applications to scale to larger models without requiring dedicated hardware reservations.
Similar permission separation might reduce overhead in other ARM TrustZone use cases such as secure storage or authentication.
Adoption would shift the design focus from static isolation to dynamic, revocable resource control in mobile security.

Load-bearing premise

The normal-world OS can be trusted to allocate and reclaim secure resources correctly even when compromised, without any ability to read or tamper with their contents.

What would settle it

A test in which a compromised normal-world kernel reads the contents of Flex-Mem or issues unauthorized commands to Flex-NPU after the isolation mechanism is applied.

Figures

Figures reproduced from arXiv: 2606.23370 by Jinyu Gu, Lixiang Wang, Yinpeng Wu, Yitong Chen, Yubin Xia, Zhichao Hua.

**Figure 2.** Figure 2: Breakdown of the TTFTs of the normal-world inference and the TrustZone-based strawman (Llama3.1 8B with a 128-token prompt). NPU. We compare FlexServe against two TrustZone-based strawman designs. The results show that FlexServe achieves an average 10.05× TTFT speedup over the strawman, and an average 2.44× TTFT speedup over an optimized strawman with pipelining and the secure NPU enabled. For agent workf… view at source ↗

**Figure 3.** Figure 3: System overview of FlexServe: The Flex-Monitor constructs the Flex-Mem and Flex-NPU, and the FlexServe Framework provides a fast and secure LLM inference framework. (VA) to the intermediate physical address (IPA) for each VM. The Stage-2 Page Table (S2PT), controlled by the hypervisor, then translates the IPA to the physical address (PA) for each VM. The System MMU (SMMU) is introduced to enforce access c… view at source ↗

**Figure 4.** Figure 4: Memory Protection of FlexServe. allocation, the secure-world Trusted OS maps the Flex-Mem pages into the FlexServe Framework’s address space. Details about the FlexServe Framework are provided in Section 5. If a Flex-Mem page is reclaimed, the Flex-Monitor remaps it in the normal-world S2PT and returns ownership to the normal-world OS. DMA Protection: Direct Memory Access (DMA) may be abused to access Flex… view at source ↗

**Figure 5.** Figure 5: Prefill time (TTFT) without cache. 0 5 10 15 20 25 Llama3.2 3B Llama3.1 8B Qwen3 0.6B Qwen3 1.7B NW-Base FlexServe Strawman-OPT Strawman Tokens Per Second 0 5 10 15 20 25 Llama3.2 3B Llama3.1 8B Qwen3 0.6B Qwen3 1.7B NW-Base FlexServe Strawman-OPT Strawman Tokens Per Second [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Decode throughput without cache. 7.3 Performance without Cache To answer question-2, we evaluate the prefill and decode performance of FlexServe. The model weights and KV caches are disabled to reveal the cold-start performance. Prefill Performance: We evaluate models ranging from 1.7B to 8B, all quantized to INT8 precision [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Prefill time (TTFT) under varying background memory pressure, without cache. FlexServe Strawman-OPT Strawman 10 30 50 0 2 4 6 8 UC OA DD AD TTFT (s) (a) Qwen3 1.7B + Llama3.1 8B 8 10 0 1 2 3 4 UC OA DD AD TTFT (s) (b) Qwen3 0.6B + Qwen3 1.7B 20 50 80 0 3 6 9 12 UC OA DD AD TTFT (s) (c) Qwen3 0.6B + Qwen3 8B 10 20 30 0 2 4 6 8 UC OA DD AD TTFT (s) (d) Qwen3 1.7B + Llama3.2 3B [PITH_FULL_IMAGE:figures/full_… view at source ↗

**Figure 8.** Figure 8: Prefill time (TTFT) of different model groups and different benchmarks, with 4GB cache. UC: UltraChat, OA: OpenAssistant, DD: Dolly Dataset, AD: Alpaca Data. Impact of Memory Pressure [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Response latency of real-world agent workflows. FlexServe’s speedup. Nevertheless, FlexServe still outperforms the Strawman by 14.15× and Strawman-OPT by 2.94× on average. 7.5 Overhead to Normal-World Applications To answer question-4, we evaluate how FlexServe affects the performance of normal-world applications. Influence of Memory Allocation: We run SQLite [27] as the normal-world application, continuo… view at source ↗

**Figure 10.** Figure 10: Normalized performance of normal-world applications (bars, higher is better) and prefill time of secure inference (lines, lower is better). 2.46%. Moreover, the on-demand protection eliminates this overhead entirely when no secure inference task is active. End-to-End Application Overhead: To demonstrate the benefits of FlexServe’s cooperative memory management and LLM-aware memory reclamation, we evaluate… view at source ↗

**Figure 11.** Figure 11: Ablation of prefill optimization. Flex-NPU and Flex-Mem contribute the largest performance gains, reducing TTFT by 49.01% and 37.89% on average, respectively. Even after Flex-NPU, Flex-Mem, and pipeline optimizations are enabled, FlexServe’s cache optimization further reduces TTFT by 36.59% on average. This confirms the importance of FlexServe’s memory management mechanism. Although our evaluation pla… view at source ↗

read the original abstract

Device-side Large Language Models (LLMs) have grown explosively, offering stronger privacy and higher availability than their cloud-side counterparts. During LLM inference, both the model weights and the user data are valuable, and attackers may compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead to both the secure inference and the normal aplications, due to two challenges: the inflexible resource isolation and the inefficient secure resource management. To address these challenges, this paper presents FlexServe, a fast and secure LLM inference system for mobile devices. The key idea is to decouple the access permission from the management permission of secure resources, so that the normal-world OS cannot access them but can still manage them as usual. First, FlexServe introduces a Recallable Resource Isolation mechanism to construct Recallable Secure Memory (Flex-Mem) and a Recallable Secure NPU (Flex-NPU). They can only be accessed by the secure world, but can be efficiently allocated and reclaimed by the normal-world OS. Based on them, FlexServe further introduces a FlexServe Framework to run secure LLM inference in the secure world. It works together with the normal-world OS to perform cooperative secure memory management. We implement a prototype of FlexServe and compare it with two TrustZone-based strawman designs. The results show that FlexServe achieves average TTFT speedups of 10.05X over the strawman and 2.44X over an optimized strawman.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes letting the normal OS manage but not access TrustZone memory and NPU for faster secure LLM inference on mobiles, but the abstract supplies no enforcement mechanism or security argument for that separation.

read the letter

The main takeaway is a proposed recallable isolation scheme that decouples access permission from management permission in TrustZone. This lets the normal-world OS allocate and reclaim Flex-Mem and Flex-NPU while the secure world alone can touch the contents, with claimed 10.05X TTFT gains over a basic strawman and 2.44X over an optimized one for on-device LLM inference.

What is new is the specific application of this decoupling to LLM serving, plus the cooperative framework that pairs the secure inference with normal-world management. The paper correctly flags the usual TrustZone overheads for both secure and normal workloads as the core practical barrier.

It does a reasonable job framing a real deployment problem that matters for private mobile AI. The direction is practical and the performance target is concrete.

The soft spot is that the abstract states the separation exists but gives no hardware primitive, TrustZone configuration, or argument showing why a compromised normal kernel cannot still read or tamper during reclamation. That assumption is load-bearing; if it does not hold, the security premise and the speedups both collapse. No implementation sketch, workload description, or error bars are supplied either, so the numbers cannot be assessed.

This is for researchers working on TrustZone extensions or on-device secure inference. A reader already thinking about resource isolation on ARM might pick up the recallable concept as something to explore. The abstract alone is too thin for citation or strong conclusions.

I would send it to peer review because the problem is current and the idea targets a genuine gap, even though the key enforcement details are missing from what is available here.

Referee Report

1 major / 0 minor

Summary. The manuscript claims to introduce FlexServe, a system for fast and secure LLM inference on mobile devices protected by ARM TrustZone. It proposes Recallable Resource Isolation to construct Flex-Mem and Flex-NPU, which decouple access permission (restricted to the secure world) from management permission (allowing the normal-world OS to allocate and reclaim them). This is said to overcome inflexible isolation and inefficient management in prior TrustZone uses for LLMs. A prototype is reported to achieve average TTFT speedups of 10.05X versus a basic strawman and 2.44X versus an optimized strawman.

Significance. If the claimed separation between access and management can be enforced without introducing new attack surfaces, the approach could meaningfully reduce the overhead of hardware-isolated LLM inference on resource-constrained mobile devices, improving the practicality of on-device models under a compromised-OS threat model. The performance numbers, if reproducible, would indicate a substantial improvement over existing TrustZone-based designs for this workload.

major comments (1)

[Abstract] Abstract: The central security claim—that Flex-Mem and Flex-NPU 'can only be accessed by the secure world' while remaining 'efficiently allocated and reclaimed by the normal-world OS'—is stated without any description of the underlying hardware primitive, TrustZone memory-region configuration, NPU access-control mechanism, or secure-monitor call sequence that would prevent a compromised normal-world kernel from reading or tampering with the resources during reclamation. This mechanism is load-bearing for both the security argument and the relevance of the reported TTFT speedups.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The point about the abstract's security claim is well-taken, and we address it directly below.

read point-by-point responses

Referee: [Abstract] Abstract: The central security claim—that Flex-Mem and Flex-NPU 'can only be accessed by the secure world' while remaining 'efficiently allocated and reclaimed by the normal-world OS'—is stated without any description of the underlying hardware primitive, TrustZone memory-region configuration, NPU access-control mechanism, or secure-monitor call sequence that would prevent a compromised normal-world kernel from reading or tampering with the resources during reclamation. This mechanism is load-bearing for both the security argument and the relevance of the reported TTFT speedups.

Authors: We agree that the abstract, as a high-level summary, does not include the low-level enforcement details. The manuscript body provides these descriptions (TrustZone memory region attributes set to secure-world only, NPU access control via secure-world ownership, and SMC sequences for allocation/reclamation that revoke normal-world access). However, to make the central claim self-contained in the abstract, we will revise it to briefly reference the hardware primitives and SMC-based reclamation protocol. This addresses the concern without altering the reported performance results. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on measured speedups vs. strawmen, not derived equations or self-referential fits

full rationale

The abstract contains no equations, parameters, or derivation steps. The reported TTFT speedups (10.05X and 2.44X) are explicitly presented as results from a prototype implementation compared against two strawman designs. No self-citations, fitted inputs renamed as predictions, or ansatzes appear. The design description (decoupling access and management permissions via Recallable Resource Isolation) is a proposed mechanism whose correctness is left to implementation details outside the abstract; it does not reduce to a tautology or self-citation chain within the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities beyond the two new constructs (Flex-Mem, Flex-NPU) are described. The security of the permission decoupling is an unstated domain assumption.

axioms (1)

domain assumption ARM TrustZone provides a secure world that can be isolated from a compromised normal-world kernel
Implicit in the use of TrustZone as the isolation technology; stated in the abstract's problem setup.

invented entities (2)

Recallable Secure Memory (Flex-Mem) no independent evidence
purpose: Secure memory that normal-world OS can allocate/reclaim but cannot access
New construct introduced to solve inflexible resource isolation
Recallable Secure NPU (Flex-NPU) no independent evidence
purpose: Secure NPU that normal-world OS can allocate/reclaim but cannot access
New construct introduced to solve inflexible resource isolation

pith-pipeline@v0.9.1-grok · 5818 in / 1364 out tokens · 16076 ms · 2026-07-03T23:09:06.489151+00:00 · methodology

Review history (2 revisions) →

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)