pith. sign in

arxiv: 2606.23370 · v2 · pith:5PP4RSV5new · submitted 2026-06-22 · 💻 cs.CR · cs.LG· cs.OS

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Pith reviewed 2026-07-03 23:09 UTC · model grok-4.3

classification 💻 cs.CR cs.LGcs.OS
keywords LLM inferencemobile devicesTrustZoneresource isolationsecure servingon-device AIARM securitymemory management
0
0 comments X

The pith

FlexServe decouples access from management permissions in TrustZone so the normal OS can allocate secure memory and NPU for LLMs without reading their contents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that TrustZone protections for on-device LLMs suffer from inflexible isolation and slow resource management, creating high overhead for both secure inference and normal apps. FlexServe introduces Recallable Resource Isolation to create Flex-Mem and Flex-NPU that only the secure world can access yet the normal-world OS can still allocate and reclaim efficiently. This decoupling enables cooperative management between the two worlds through a new framework for LLM inference. If the approach holds, private model weights and user data stay protected from a compromised kernel while time-to-first-token latency drops sharply. The reported results show average speedups of 10.05X versus a basic strawman and 2.44X versus an optimized one.

Core claim

FlexServe presents Recallable Resource Isolation to construct Recallable Secure Memory (Flex-Mem) and Recallable Secure NPU (Flex-NPU). These resources remain accessible only to the secure world while the normal-world OS retains normal allocation and reclamation rights. The FlexServe Framework then coordinates with the normal OS for cooperative secure memory management during LLM inference, protecting both model weights and user data from kernel compromise.

What carries the argument

Recallable Resource Isolation, which separates access permission from management permission so the normal OS can handle allocation and reclamation without gaining read or write access.

If this is right

  • Secure LLM inference runs with substantially lower time-to-first-token latency than prior TrustZone designs.
  • The normal-world OS can manage secure resources at normal speed without impacting its own applications.
  • Model weights and user data remain protected from kernel-level attackers throughout inference.
  • Cooperative memory management between secure and normal worlds becomes feasible for other resource-heavy secure tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decoupling pattern could apply to other on-device accelerators or memory pools beyond NPU.
  • It may allow secure-world applications to scale to larger models without requiring dedicated hardware reservations.
  • Similar permission separation might reduce overhead in other ARM TrustZone use cases such as secure storage or authentication.
  • Adoption would shift the design focus from static isolation to dynamic, revocable resource control in mobile security.

Load-bearing premise

The normal-world OS can be trusted to allocate and reclaim secure resources correctly even when compromised, without any ability to read or tamper with their contents.

What would settle it

A test in which a compromised normal-world kernel reads the contents of Flex-Mem or issues unauthorized commands to Flex-NPU after the isolation mechanism is applied.

Figures

Figures reproduced from arXiv: 2606.23370 by Jinyu Gu, Lixiang Wang, Yinpeng Wu, Yitong Chen, Yubin Xia, Zhichao Hua.

Figure 2
Figure 2. Figure 2: Breakdown of the TTFTs of the normal-world inference and the TrustZone-based strawman (Llama3.1 8B with a 128-token prompt). NPU. We compare FlexServe against two TrustZone-based strawman designs. The results show that FlexServe achieves an average 10.05× TTFT speedup over the strawman, and an average 2.44× TTFT speedup over an optimized straw￾man with pipelining and the secure NPU enabled. For agent workf… view at source ↗
Figure 3
Figure 3. Figure 3: System overview of FlexServe: The Flex-Monitor constructs the Flex-Mem and Flex-NPU, and the FlexServe Framework provides a fast and secure LLM inference framework. (VA) to the intermediate physical address (IPA) for each VM. The Stage-2 Page Table (S2PT), controlled by the hypervisor, then translates the IPA to the physical address (PA) for each VM. The System MMU (SMMU) is introduced to enforce ac￾cess c… view at source ↗
Figure 4
Figure 4. Figure 4: Memory Protection of FlexServe. allocation, the secure-world Trusted OS maps the Flex-Mem pages into the FlexServe Framework’s address space. Details about the FlexServe Framework are provided in Section 5. If a Flex-Mem page is reclaimed, the Flex-Monitor remaps it in the normal-world S2PT and returns ownership to the normal-world OS. DMA Protection: Direct Memory Access (DMA) may be abused to access Flex… view at source ↗
Figure 5
Figure 5. Figure 5: Prefill time (TTFT) without cache. 0 5 10 15 20 25 Llama3.2 3B Llama3.1 8B Qwen3 0.6B Qwen3 1.7B NW-Base FlexServe Strawman-OPT Strawman Tokens Per Second 0 5 10 15 20 25 Llama3.2 3B Llama3.1 8B Qwen3 0.6B Qwen3 1.7B NW-Base FlexServe Strawman-OPT Strawman Tokens Per Second [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Decode throughput without cache. 7.3 Performance without Cache To answer question-2, we evaluate the prefill and decode performance of FlexServe. The model weights and KV caches are disabled to reveal the cold-start performance. Prefill Performance: We evaluate models ranging from 1.7B to 8B, all quantized to INT8 precision [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prefill time (TTFT) under varying background memory pressure, without cache. FlexServe Strawman-OPT Strawman 10 30 50 0 2 4 6 8 UC OA DD AD TTFT (s) (a) Qwen3 1.7B + Llama3.1 8B 8 10 0 1 2 3 4 UC OA DD AD TTFT (s) (b) Qwen3 0.6B + Qwen3 1.7B 20 50 80 0 3 6 9 12 UC OA DD AD TTFT (s) (c) Qwen3 0.6B + Qwen3 8B 10 20 30 0 2 4 6 8 UC OA DD AD TTFT (s) (d) Qwen3 1.7B + Llama3.2 3B [PITH_FULL_IMAGE:figures/full_… view at source ↗
Figure 8
Figure 8. Figure 8: Prefill time (TTFT) of different model groups and different benchmarks, with 4GB cache. UC: UltraChat, OA: OpenAssistant, DD: Dolly Dataset, AD: Alpaca Data. Impact of Memory Pressure [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Response latency of real-world agent workflows. FlexServe’s speedup. Nevertheless, FlexServe still outper￾forms the Strawman by 14.15× and Strawman-OPT by 2.94× on average. 7.5 Overhead to Normal-World Applications To answer question-4, we evaluate how FlexServe affects the performance of normal-world applications. Influence of Memory Allocation: We run SQLite [27] as the normal-world application, continuo… view at source ↗
Figure 10
Figure 10. Figure 10: Normalized performance of normal-world applications (bars, higher is better) and prefill time of secure inference (lines, lower is better). 2.46%. Moreover, the on-demand protection eliminates this overhead entirely when no secure inference task is active. End-to-End Application Overhead: To demonstrate the benefits of FlexServe’s cooperative memory management and LLM-aware memory reclamation, we evaluate… view at source ↗
Figure 11
Figure 11. Figure 11: Ablation of prefill optimization. Flex-NPU and Flex-Mem contribute the largest perfor￾mance gains, reducing TTFT by 49.01% and 37.89% on av￾erage, respectively. Even after Flex-NPU, Flex-Mem, and pipeline optimizations are enabled, FlexServe’s cache opti￾mization further reduces TTFT by 36.59% on average. This confirms the importance of FlexServe’s memory manage￾ment mechanism. Although our evaluation pla… view at source ↗
read the original abstract

Device-side Large Language Models (LLMs) have grown explosively, offering stronger privacy and higher availability than their cloud-side counterparts. During LLM inference, both the model weights and the user data are valuable, and attackers may compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead to both the secure inference and the normal aplications, due to two challenges: the inflexible resource isolation and the inefficient secure resource management. To address these challenges, this paper presents FlexServe, a fast and secure LLM inference system for mobile devices. The key idea is to decouple the access permission from the management permission of secure resources, so that the normal-world OS cannot access them but can still manage them as usual. First, FlexServe introduces a Recallable Resource Isolation mechanism to construct Recallable Secure Memory (Flex-Mem) and a Recallable Secure NPU (Flex-NPU). They can only be accessed by the secure world, but can be efficiently allocated and reclaimed by the normal-world OS. Based on them, FlexServe further introduces a FlexServe Framework to run secure LLM inference in the secure world. It works together with the normal-world OS to perform cooperative secure memory management. We implement a prototype of FlexServe and compare it with two TrustZone-based strawman designs. The results show that FlexServe achieves average TTFT speedups of 10.05X over the strawman and 2.44X over an optimized strawman.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript claims to introduce FlexServe, a system for fast and secure LLM inference on mobile devices protected by ARM TrustZone. It proposes Recallable Resource Isolation to construct Flex-Mem and Flex-NPU, which decouple access permission (restricted to the secure world) from management permission (allowing the normal-world OS to allocate and reclaim them). This is said to overcome inflexible isolation and inefficient management in prior TrustZone uses for LLMs. A prototype is reported to achieve average TTFT speedups of 10.05X versus a basic strawman and 2.44X versus an optimized strawman.

Significance. If the claimed separation between access and management can be enforced without introducing new attack surfaces, the approach could meaningfully reduce the overhead of hardware-isolated LLM inference on resource-constrained mobile devices, improving the practicality of on-device models under a compromised-OS threat model. The performance numbers, if reproducible, would indicate a substantial improvement over existing TrustZone-based designs for this workload.

major comments (1)
  1. [Abstract] Abstract: The central security claim—that Flex-Mem and Flex-NPU 'can only be accessed by the secure world' while remaining 'efficiently allocated and reclaimed by the normal-world OS'—is stated without any description of the underlying hardware primitive, TrustZone memory-region configuration, NPU access-control mechanism, or secure-monitor call sequence that would prevent a compromised normal-world kernel from reading or tampering with the resources during reclamation. This mechanism is load-bearing for both the security argument and the relevance of the reported TTFT speedups.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The point about the abstract's security claim is well-taken, and we address it directly below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central security claim—that Flex-Mem and Flex-NPU 'can only be accessed by the secure world' while remaining 'efficiently allocated and reclaimed by the normal-world OS'—is stated without any description of the underlying hardware primitive, TrustZone memory-region configuration, NPU access-control mechanism, or secure-monitor call sequence that would prevent a compromised normal-world kernel from reading or tampering with the resources during reclamation. This mechanism is load-bearing for both the security argument and the relevance of the reported TTFT speedups.

    Authors: We agree that the abstract, as a high-level summary, does not include the low-level enforcement details. The manuscript body provides these descriptions (TrustZone memory region attributes set to secure-world only, NPU access control via secure-world ownership, and SMC sequences for allocation/reclamation that revoke normal-world access). However, to make the central claim self-contained in the abstract, we will revise it to briefly reference the hardware primitives and SMC-based reclamation protocol. This addresses the concern without altering the reported performance results. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on measured speedups vs. strawmen, not derived equations or self-referential fits

full rationale

The abstract contains no equations, parameters, or derivation steps. The reported TTFT speedups (10.05X and 2.44X) are explicitly presented as results from a prototype implementation compared against two strawman designs. No self-citations, fitted inputs renamed as predictions, or ansatzes appear. The design description (decoupling access and management permissions via Recallable Resource Isolation) is a proposed mechanism whose correctness is left to implementation details outside the abstract; it does not reduce to a tautology or self-citation chain within the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities beyond the two new constructs (Flex-Mem, Flex-NPU) are described. The security of the permission decoupling is an unstated domain assumption.

axioms (1)
  • domain assumption ARM TrustZone provides a secure world that can be isolated from a compromised normal-world kernel
    Implicit in the use of TrustZone as the isolation technology; stated in the abstract's problem setup.
invented entities (2)
  • Recallable Secure Memory (Flex-Mem) no independent evidence
    purpose: Secure memory that normal-world OS can allocate/reclaim but cannot access
    New construct introduced to solve inflexible resource isolation
  • Recallable Secure NPU (Flex-NPU) no independent evidence
    purpose: Secure NPU that normal-world OS can allocate/reclaim but cannot access
    New construct introduced to solve inflexible resource isolation

pith-pipeline@v0.9.1-grok · 5818 in / 1364 out tokens · 16076 ms · 2026-07-03T23:09:06.489151+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.