pith. sign in

arxiv: 2603.09046 · v2 · submitted 2026-03-10 · 💻 cs.CR · cs.LG· cs.OS

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Pith reviewed 2026-05-15 14:18 UTC · model grok-4.3

classification 💻 cs.CR cs.LGcs.OS
keywords mobile LLM servingTrustZonesecure inferenceflexible isolationon-device AITTFT optimizationmulti-model scheduling
0
0 comments X

The pith

FlexServe allows ARM TrustZone to protect mobile LLM inference by switching memory and NPU modes on demand, cutting time to first token by over 10x versus rigid baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FlexServe to reduce the slowdown that TrustZone protection imposes on device-side LLM inference. Standard TrustZone isolation of memory and the NPU creates high overhead when shielding model weights and user data from a compromised OS kernel. FlexServe adds a mechanism that lets both memory pages and the NPU flip rapidly between protected and unprotected states. It then layers an LLM-aware memory manager, a secure inference pipeline, and a multi-model scheduler on top of this flexibility. The resulting system targets the gap between the privacy promise of on-device LLMs and the performance cost that currently makes them impractical.

Core claim

FlexServe constructs Flexible Secure Memory and Flexible Secure NPU through a Flexible Resource Isolation mechanism that supports fast mode switches. Inside TrustZone's secure world it adds LLM-Aware Memory Management and a Secure Inference Pipeline for single-model acceleration, plus a Multi-Model Scheduler for agent-style workflows. Prototype measurements show these changes produce large reductions in inference latency compared with both basic and pipeline-enabled TrustZone strawman designs.

What carries the argument

Flexible Resource Isolation mechanism that switches memory pages and the NPU between unprotected and protected modes

Load-bearing premise

The overhead and security properties of rapid mode switches between protected and unprotected states remain stable when measured on production mobile hardware and under realistic kernel attacks.

What would settle it

If benchmarks on additional devices with live kernel exploits show that mode-switch latency or data exposure exceeds the reported gains, the central speedup and security claims would fail.

Figures

Figures reproduced from arXiv: 2603.09046 by Jinyu Gu, Lixiang Wang, Yinpeng Wu, Yitong Chen, Yubin Xia, Zhichao Hua.

Figure 1
Figure 1. Figure 1: Latency of allocating memory with different sizes [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Breakdown of the TTFTs of normal-world inference [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: System overview of FlexServe: The Flex-Monitor constructs the Flex-Mem and Flex-NPU, and the FlexServe Framework provides a fast and secure LLM inference framework. model weights and input/output are protected. All normal￾world applications are considered untrusted. FlexServe as￾sumes the initial kernel code is benign and that secure boot protects its integrity. However, the kernel may contain bugs and cou… view at source ↗
Figure 4
Figure 4. Figure 4: Memory Protection of FlexServe. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: TTFT with different input lengths and models. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Decode throughput with different models. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: TTFT under varying background memory pressure. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: TTFT of different model groups on real-world benchmarks with a 4GB model cache. UC: UltraChat, OA: [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Response latency of real-world agent workflows. [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance overhead to the SQLite. cores. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
read the original abstract

Device-side Large Language Models (LLMs) have witnessed explosive growth, offering higher privacy and availability compared to cloud-side LLMs. During LLM inference, both model weights and user data are valuable, and attackers may even compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead due to its inflexible isolation of memory and the NPU. To address these challenges, this paper introduces FlexServe, a fast and secure LLM serving system for mobile devices. It first introduces a Flexible Resource Isolation mechanism to construct Flexible Secure Memory (Flex-Mem) and Flexible Secure NPU (Flex-NPU). Both memory pages and the NPU can be efficiently switched between unprotected and protected modes. Based on these mechanisms, FlexServe designs a fast and secure LLM inference framework within TrustZone's secure world. The LLM-Aware Memory Management and Secure Inference Pipeline are introduced to accelerate inference. A Multi-Model Scheduler is proposed to optimize multi-model workflows. We implement a prototype of FlexServe and compare it with two TrustZone-based strawman designs. The results show that FlexServe achieves an average $10.05\times$ speedup in Time to First Token (TTFT) compared to the strawman, and an average $2.44\times$ TTFT speedup compared to an optimized strawman with pipeline and secure NPU enabled. For multi-model agent workflows, the end-to-end speedup is up to $24.30\times$ and $4.05\times$ compared to the strawman and optimized strawman, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents FlexServe, a secure LLM serving system for mobile devices that uses ARM TrustZone with a new Flexible Resource Isolation mechanism. This enables efficient dynamic switching of memory pages (Flex-Mem) and the NPU (Flex-NPU) between protected and unprotected modes. Building on these, the system adds LLM-Aware Memory Management, a Secure Inference Pipeline, and a Multi-Model Scheduler. A prototype implementation is evaluated against two TrustZone-based strawman designs, reporting average TTFT speedups of 10.05× versus the basic strawman and 2.44× versus an optimized strawman (with pipeline and secure NPU), plus end-to-end gains up to 24.30× and 4.05× for multi-model agent workflows.

Significance. If the performance claims are supported by complete characterization of mode-switching costs, this work would be significant for practical on-device LLM deployment. It directly addresses the tension between strong hardware isolation (TrustZone) and inference efficiency on resource-constrained mobile devices, offering a concrete prototype that demonstrates flexible isolation can deliver substantial speedups while maintaining security guarantees.

major comments (2)
  1. [Evaluation] Evaluation section: The headline TTFT claims (10.05× vs strawman, 2.44× vs optimized strawman) and multi-model gains (up to 24.30× / 4.05×) attribute improvements to Flexible Resource Isolation, yet no microbenchmark data, switch counts per inference step, or ablation isolating Flex-Mem/Flex-NPU switching latency from LLM-Aware Memory Management or the pipeline is provided. Without these, it is impossible to confirm that mode-switching overheads (e.g., TLB invalidation or NPU reconfiguration) are negligible relative to inference time.
  2. [§4.3] §4.3 (Secure Inference Pipeline): The integration of Flex-NPU mode switching with pipeline stages is described at a high level, but the paper does not quantify reconfiguration costs or their accumulation across token generation steps. This is load-bearing for the central claim that flexible isolation accelerates inference without eroding the reported speedups.
minor comments (2)
  1. [Abstract] The abstract and introduction refer to 'strawman designs' without a concise summary of their key limitations; adding one sentence would improve accessibility for readers.
  2. [Evaluation] Performance figures lack error bars, standard deviations, or details on workload selection and measurement methodology, which are standard for empirical systems papers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the evaluation. We agree that additional microbenchmark data and quantifications will strengthen the paper and will revise the manuscript accordingly to address both major points.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The headline TTFT claims (10.05× vs strawman, 2.44× vs optimized strawman) and multi-model gains (up to 24.30× / 4.05×) attribute improvements to Flexible Resource Isolation, yet no microbenchmark data, switch counts per inference step, or ablation isolating Flex-Mem/Flex-NPU switching latency from LLM-Aware Memory Management or the pipeline is provided. Without these, it is impossible to confirm that mode-switching overheads (e.g., TLB invalidation or NPU reconfiguration) are negligible relative to inference time.

    Authors: We agree that microbenchmark data would better isolate contributions and confirm negligible overheads. In the revised manuscript we will add: (1) microbenchmarks measuring Flex-Mem and Flex-NPU switching latencies including TLB invalidation and NPU reconfiguration costs; (2) the exact number of mode switches per inference step for representative workloads; and (3) an ablation study separating Flexible Resource Isolation from LLM-Aware Memory Management and the pipeline. These additions will directly show that switching costs remain negligible relative to inference time and support the reported speedups. revision: yes

  2. Referee: [§4.3] §4.3 (Secure Inference Pipeline): The integration of Flex-NPU mode switching with pipeline stages is described at a high level, but the paper does not quantify reconfiguration costs or their accumulation across token generation steps. This is load-bearing for the central claim that flexible isolation accelerates inference without eroding the reported speedups.

    Authors: We acknowledge the need for explicit quantification. In the revision we will expand §4.3 with measured Flex-NPU reconfiguration latencies and an analysis of their cumulative impact across successive token-generation steps. The new data will demonstrate that these costs do not erode the overall speedups delivered by flexible isolation, thereby reinforcing the central performance claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical prototype benchmarks

full rationale

The paper describes a systems implementation (Flexible Resource Isolation, LLM-Aware Memory Management, Secure Inference Pipeline, Multi-Model Scheduler) and reports measured speedups from a prototype against strawman baselines. No equations, first-principles derivations, or predictions appear that reduce by construction to fitted inputs or self-referential definitions. Performance numbers are direct experimental results, not outputs of any model that was calibrated on the same quantities. Self-citations, if present, are not load-bearing for the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the introduction of two new mechanisms (Flex-Mem and Flex-NPU) without independent evidence beyond the prototype. It relies on the standard assumption that TrustZone provides effective isolation.

axioms (1)
  • domain assumption ARM TrustZone provides hardware-based isolation between secure and normal worlds that protects against a compromised OS kernel.
    Invoked as the foundation for all secure inference claims.
invented entities (2)
  • Flex-Mem no independent evidence
    purpose: Flexible secure memory that can be efficiently switched between protected and unprotected modes.
    New mechanism introduced to reduce isolation overhead for LLM weights and data.
  • Flex-NPU no independent evidence
    purpose: Flexible secure NPU that can be efficiently switched between protected and unprotected modes.
    New mechanism introduced to reduce overhead for AI acceleration during secure inference.

pith-pipeline@v0.9.0 · 5627 in / 1409 out tokens · 68263 ms · 2026-05-15T14:18:35.483668+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 3 internal anchors

  1. [1]

    https://www.apple.com/ apple-intelligence/, Sep, 2025

    Apple intelligence. https://www.apple.com/ apple-intelligence/, Sep, 2025

  2. [2]

    https://www.samsung.com/us/ galaxy-ai/, Sep, 2025

    Galaxy ai. https://www.samsung.com/us/ galaxy-ai/, Sep, 2025

  3. [3]

    https://www.cvedetails.com/ version-list/33/47/1/Linux-Linux-Kernel

    Linux cves. https://www.cvedetails.com/ version-list/33/47/1/Linux-Linux-Kernel. html, Sep, 2025

  4. [4]

    https://www.stackscale.com/blog/ linux-kernel-surpasses-40-million-lines-code/ , Sep, 2025

    The linux kernel surpasses 40 million lines of code: A historic nilestone in open-source soft- ware. https://www.stackscale.com/blog/ linux-kernel-surpasses-40-million-lines-code/ , Sep, 2025

  5. [5]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Har- rison, Russell J Hewett, Mojan Javaheripi, Piero Kauff- mann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

  6. [6]

    stress-ng

    Aboorva Devarajan Abdul Haleem and so on. stress-ng. https://github.com/ColinIanKing/ stress-ng, 2020

  7. [7]

    Trustzone: Integrated hard- ware and software security.ARM white paper, 3(4):18– 24, 2004

    Tiago Alves and Don Felton. Trustzone: Integrated hard- ware and software security.ARM white paper, 3(4):18– 24, 2004

  8. [8]

    Android virtualiza- tion framework (avf) overview

    Android. Android virtualiza- tion framework (avf) overview. https://source.android.com/docs/core/virtualization, 2026

  9. [9]

    Memory allocation among processes

    Android. Memory allocation among processes. https://developer.android.com/topic/ performance/memory-management, 2026

  10. [10]

    Overview of memory management

    Android. Overview of memory management. https://developer.android.com/topic/ performance/memory-overview, 2026

  11. [11]

    What is the autogpt platform? https:// agpt.co/docs/platform, 2026

    AutoGPT. What is the autogpt platform? https:// agpt.co/docs/platform, 2026

  12. [12]

    Skee: A lightweight secure kernel-level execution environment for arm

    Ahmed M Azab, Kirk Swidowski, Jia Ma Bhutkar, Wenbo Shen, Ruowen Wang, and Peng Ning. Skee: A lightweight secure kernel-level execution environment for arm. InNetwork & Distributed System Security Symposium (NDSS), 2016

  13. [13]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  14. [14]

    Sanctuary: Arming trustzone with user-space enclaves

    Ferdinand Brasser, David Gens, Patrick Jauernig, Ahmad-Reza Sadeghi, and Emmanuel Stapf. Sanctuary: Arming trustzone with user-space enclaves. 2019

  15. [15]

    Char- acterizing mobile soc for accelerating heterogeneous llm inference

    Le Chen, Dahu Feng, Erhu Feng, Yingrui Wang, Rong Zhao, Yubin Xia, Pinjie Xu, and Haibo Chen. Char- acterizing mobile soc for accelerating heterogeneous llm inference. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 359– 374, 2025

  16. [16]

    Hardware-assisted on-demand hypervisor activation for efficient security critical code execution on mobile de- vices

    Yeongpil Cho, Junbum Shin, Donghyun Kwon, MyungJoo Ham, Yuna Kim, and Yunheung Paek. Hardware-assisted on-demand hypervisor activation for efficient security critical code execution on mobile de- vices. In2016 USENIX Annual Technical Conference (USENIX ATC 16), pages 565–578. USENIX Associa- tion, 2016

  17. [17]

    Intel sgx explained

    Victor Costan and Srinivas Devadas. Intel sgx explained. Cryptology ePrint Archive, 2016

  18. [18]

    The rising costs of training frontier ai models,

    Ben Cottier, Robi Rahman, Loredana Fattorini, Nestor Maslej, Tamay Besiroglu, and David Owen. The ris- ing costs of training frontier ai models.arXiv preprint arXiv:2405.21015, 2024

  19. [19]

    Strongbox: A gpu tee on arm endpoints

    Yunjie Deng, Chenxu Wang, Shunchang Yu, Shiqing Liu, Zhenyu Ning, Kevin Leach, Jin Li, Shoumeng Yan, Zhengyu He, Jiannong Cao, et al. Strongbox: A gpu tee on arm endpoints. InProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 769–783, 2022

  20. [20]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

  21. [21]

    Serdab: An iot framework for partitioning neural networks computa- tion across multiple enclaves

    Tarek Elgamal and Klara Nahrstedt. Serdab: An iot framework for partitioning neural networks computa- tion across multiple enclaves. In2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pages 519–528. IEEE, 2020. 14

  22. [22]

    Xputee: a high-performance and practical heteroge- neous trusted execution environment for gpus.ACM Transactions on Computer Systems, 43(1-2):1–27, 2025

    Shulin Fan, Zhichao Hua, Yubin Xia, and Haibo Chen. Xputee: a high-performance and practical heteroge- neous trusted execution environment for gpus.ACM Transactions on Computer Systems, 43(1-2):1–27, 2025

  23. [23]

    AI4Finance Foundation. Fingpt. https:// huggingface.co/FinGPT, 2026

  24. [24]

    On-device small language models with multi- modality, rag, and function calling, 2026

    Google. On-device small language models with multi- modality, rag, and function calling, 2026

  25. [25]

    Privado: Prac- tical and secure dnn inference with enclaves.arXiv preprint arXiv:1810.00602, 2018

    Karan Grover, Shruti Tople, Shweta Shinde, Ranjita Bhagwan, and Ramachandran Ramjee. Privado: Prac- tical and secure dnn inference with enclaves.arXiv preprint arXiv:1810.00602, 2018

  26. [26]

    Trustshadow: Se- cure execution of unmodified applications with arm trustzone

    Le Guan, Peng Liu, Xinyu Xing, Xinyang Ge, Shengzhi Zhang, Meng Yu, and Trent Jaeger. Trustshadow: Se- cure execution of unmodified applications with arm trustzone. InProceedings of the 15th Annual Inter- national Conference on Mobile Systems, Applications, and Services, pages 488–501, 2017

  27. [27]

    Richard Hipp

    D. Richard Hipp. Sqlite. https://www.sqlite.org/. Version 3.x, accessed 2024-05-10

  28. [28]

    {vTZ}: virtualizing {ARM}{TrustZone}

    Zhichao Hua, Jinyu Gu, Yubin Xia, Haibo Chen, Binyu Zang, and Haibing Guan. {vTZ}: virtualizing {ARM}{TrustZone}. In26th USENIX Security Sympo- sium (USENIX Security 17), pages 541–556, 2017

  29. [29]

    Rossbach, and Emmett Witchel

    Tyler Hunt, Zhipeng Jia, Vance Miller, Ariel Szekely, Yige Hu, Christopher J. Rossbach, and Emmett Witchel. Telekine: Secure computing with cloud GPUs. In17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 817–833, Santa Clara, CA, 2020. USENIX Association

  30. [30]

    Confidential execution of deep learning inference at the untrusted edge with arm trustzone

    Md Shihabul Islam, Mahmoud Zamani, Chung Hwan Kim, Latifur Khan, and Kevin W Hamlen. Confidential execution of deep learning inference at the untrusted edge with arm trustzone. InProceedings of the Thir- teenth ACM Conference on Data and Application Secu- rity and Privacy, pages 153–164, 2023

  31. [31]

    SAGE: Software-based attestation for GPU execu- tion

    Andrei Ivanov, Benjamin Rothenberger, Arnaud De- thise, Marco Canini, Torsten Hoefler, and Adrian Per- rig. SAGE: Software-based attestation for GPU execu- tion. In2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 485–499, Boston, MA, July

  32. [32]

    Heterogeneous isolated execution for commodity gpus

    Insu Jang, Adrian Tang, Taehoon Kim, Simha Sethu- madhavan, and Jaehyuk Huh. Heterogeneous isolated execution for commodity gpus. InProceedings of the Twenty-Fourth International Conference on Architec- tural Support for Programming Languages and Operat- ing Systems, pages 455–468, 2019

  33. [33]

    Smartzone: Runtime sup- port for secure and efficient on-device inference on arm trustzone.IEEE Transactions on Computers, 2025

    Zhaolong Jian, Xu Liu, Qiankun Dong, Longkai Cheng, Xueshuo Xie, and Tao Li. Smartzone: Runtime sup- port for secure and efficient on-device inference on arm trustzone.IEEE Transactions on Computers, 2025

  34. [34]

    and Raffel, C

    Nikhil Kandpal and Colin Raffel. Position: The most expensive part of an llm should be its training data. arXiv preprint arXiv:2504.12427, 2025

  35. [35]

    Gonza- lez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with pagedat- tention, 2023

  36. [36]

    Occlumency: Privacy-preserving remote deep-learning inference us- ing sgx

    Taegyeong Lee, Zhiqi Lin, Saumay Pushp, Caihua Li, Yunxin Liu, Youngki Lee, Fengyuan Xu, Chenren Xu, Lintao Zhang, and Junehwa Song. Occlumency: Privacy-preserving remote deep-learning inference us- ing sgx. InThe 25th Annual International Conference on Mobile Computing and Networking, pages 1–17, 2019

  37. [37]

    Apple intelligence foundation language models: Tech report 2025.arXiv preprint arXiv:2507.13575, 2025

    Ethan Li, Anders Boesen Lindbo Larsen, Chen Zhang, Xiyou Zhou, Jun Qin, Dian Ang Yap, Narendran Ragha- van, Xuankai Chang, Margit Bowler, Eray Yildiz, et al. Apple intelligence foundation language models: Tech report 2025.arXiv preprint arXiv:2507.13575, 2025

  38. [38]

    Translinkguard: safeguard- ing transformer models against model stealing in edge deployment

    Qinfeng Li, Zhiqiang Shen, Zhenghan Qin, Yangfan Xie, Xuhong Zhang, Tianyu Du, Sheng Cheng, Xun Wang, and Jianwei Yin. Translinkguard: safeguard- ing transformer models against model stealing in edge deployment. InProceedings of the 32nd ACM Inter- national Conference on Multimedia, pages 3479–3488, 2024

  39. [39]

    Adat- tester: Secure online mobile advertisement attestation using trustzone

    Wenhao Li, Haibo Li, Haibo Chen, and Yubin Xia. Adat- tester: Secure online mobile advertisement attestation using trustzone. InProceedings of the 13th annual in- ternational conference on mobile systems, applications, and services, pages 75–88, 2015

  40. [40]

    Build- ing trusted path on untrusted device drivers for mobile devices

    Wenhao Li, Mingyang Ma, Jinchen Han, Yubin Xia, Binyu Zang, Cheng-Kang Chu, and Tieyan Li. Build- ing trusted path on untrusted device drivers for mobile devices. InProceedings of 5th Asia-Pacific Workshop on Systems, pages 1–7, 2014

  41. [41]

    Large language models on mobile devices: Measurements, analysis, and insights

    Xiang Li, Zhenyan Lu, Dongqi Cai, Xiao Ma, and Meng- wei Xu. Large language models on mobile devices: Measurements, analysis, and insights. InProceedings of the Workshop on Edge and Mobile Foundation Models, pages 1–6, 2024

  42. [42]

    Robust safe reinforcement learning under adversarial disturbances

    Zeyang Li, Chuxiong Hu, Shengbo Eben Li, Jia Cheng, and Yunan Wang. Robust safe reinforcement learning under adversarial disturbances. In2023 62nd IEEE 15 Conference on Decision and Control (CDC), pages 334–

  43. [43]

    Ferret-ui 2: Mastering universal user interface understanding across platforms.arXiv preprint arXiv:2410.18967, 2024

    Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moor- thy, Jeff Nichols, Yinfei Yang, and Zhe Gan. Ferret-ui 2: Mastering universal user interface understanding across platforms.arXiv preprint arXiv:2410.18967, 2024

  44. [44]

    OP-TEE: Open Portable Trusted Execution Environment

    Linaro and Contributors. OP-TEE: Open Portable Trusted Execution Environment. GitHub repository, 2025

  45. [45]

    Tz-kms: A secure key management service for joint cloud com- puting with arm trustzone

    Shiyu Luo, Zhichao Hua, and Yubin Xia. Tz-kms: A secure key management service for joint cloud com- puting with arm trustzone. In2018 IEEE Symposium on Service-Oriented System Engineering (SOSE), pages 180–185. IEEE, 2018

  46. [46]

    Honeycomb: Secure and efficient {GPU} executions via static valida- tion

    Haohui Mai, Jiacheng Zhao, Hongren Zheng, Yiyang Zhao, Zibin Liu, Mingyu Gao, Cong Wang, Huimin Cui, Xiaobing Feng, and Christos Kozyrakis. Honeycomb: Secure and efficient {GPU} executions via static valida- tion. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 155–172, 2023

  47. [47]

    Darknetz: towards model privacy at the edge using trusted execution environments

    Fan Mo, Ali Shahin Shamsabadi, Kleomenis Katevas, Soteris Demetriou, Ilias Leontiadis, Andrea Cavallaro, and Hamed Haddadi. Darknetz: towards model privacy at the edge using trusted execution environments. In Proceedings of the 18th International Conference on Mobile Systems, Applications, and Services, pages 161– 174, 2020

  48. [48]

    rknn-llm

    mtx512. rknn-llm. https://github.com/mtx512/ rk3588-npu, 2023

  49. [49]

    Basics on android ram management, what is(n’t) bloat? https://www.reddit.com/r/ GalaxyS8/comments/6agads/basics_on_android_ ram_management_what_isnt_bloat/, 2026

    neomancr. Basics on android ram management, what is(n’t) bloat? https://www.reddit.com/r/ GalaxyS8/comments/6agads/basics_on_android_ ram_management_what_isnt_bloat/, 2026

  50. [50]

    The ai workspace that works for you

    Notion. The ai workspace that works for you. https: //www.notion.com/product/ai, 2026

  51. [51]

    Oblivious {Multi-Party} machine learn- ing on trusted processors

    Olga Ohrimenko, Felix Schuster, Cédric Fournet, Aastha Mehta, Sebastian Nowozin, Kapil Vaswani, and Manuel Costa. Oblivious {Multi-Party} machine learn- ing on trusted processors. In25th USENIX Security Sym- posium (USENIX Security 16), pages 619–636, 2016

  52. [52]

    Safe and practical gpu computation in trustzone

    Heejin Park and Felix Xiaozhu Lin. Safe and practical gpu computation in trustzone. InProceedings of the Eighteenth European Conference on Computer Systems, pages 505–520, 2023

  53. [53]

    The ai companion who cares always here to listen and talk.https://replika.ai/, 2026

    Replika. The ai companion who cares always here to listen and talk.https://replika.ai/, 2026

  54. [54]

    Using arm trustzone to build a trusted lan- guage runtime for mobile applications

    Nuno Santos, Himanshu Raj, Stefan Saroiu, and Alec Wolman. Using arm trustzone to build a trusted lan- guage runtime for mobile applications. InProceedings of the 19th international conference on Architectural support for programming languages and operating sys- tems, pages 67–80, 2014

  55. [55]

    ennclave: Offline inference with model confidentiality

    Alexander Schlögl and Rainer Böhme. ennclave: Offline inference with model confidentiality. InProceedings of the 13th ACM Workshop on Artificial Intelligence and Security, pages 93–104, 2020

  56. [56]

    In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 723–738, 2022

    Tianxiang Shen, Ji Qi, Jianyu Jiang, Xian Wang, Siyuan Wen, Xusheng Chen, Shixiong Zhao, Sen Wang, Li Chen, Xiapu Luo, et al.{SOTER}: Guarding black- box inference for general neural networks at the edge. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 723–738, 2022

  57. [57]

    Standard Performance Evaluation Corporation (SPEC), Gainesville, V A, USA.SPEC CPU® 2017 Benchmark Suite, 2017.https://www.spec.org/cpu2017/

  58. [58]

    Trustice: Hardware-assisted isolated comput- ing environments on mobile devices

    He Sun, Kun Sun, Yuewu Wang, Jiwu Jing, and Haining Wang. Trustice: Hardware-assisted isolated comput- ing environments on mobile devices. InDependable Systems and Networks (DSN), 2015 45th Annual IEEE/I- FIP International Conference on, pages 367–378. IEEE, 2015

  59. [59]

    Shad- ownet: A secure and efficient on-device model inference system for convolutional neural networks

    Zhichuang Sun, Ruimin Sun, Changming Liu, Am- rita Roy Chowdhury, Long Lu, and Somesh Jha. Shad- ownet: A secure and efficient on-device model inference system for convolutional neural networks. In2023 IEEE Symposium on Security and Privacy (SP), pages 1596–

  60. [60]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Riv- ière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  61. [61]

    Graviton: Trusted execution environments on gpus

    Stavros V olos, Kapil Vaswani, and Rodrigo Bruno. Graviton: Trusted execution environments on gpus. In OSDI, pages 681–696, 2018

  62. [62]

    Tz-llm: Protecting on- device large language models with arm trustzone.arXiv preprint arXiv:2511.13717, 2025

    Xunjie Wang, Jiacheng Shi, Zihan Zhao, Yang Yu, Zhichao Hua, and Jinyu Gu. Tz-llm: Protecting on- device large language models with arm trustzone.arXiv preprint arXiv:2511.13717, 2025

  63. [63]

    Building gpu tees using cpu secure enclaves with gevi- sor

    Xiaolong Wu, Dave Jing Tian, and Chung Hwan Kim. Building gpu tees using cpu secure enclaves with gevi- sor. InProceedings of the 2023 ACM Symposium on Cloud Computing, pages 249–264, 2023. 16

  64. [64]

    Colony: A privi- leged trusted execution environment with extensibility

    Yubin Xia, Zhichao Hua, Yang Yu, Jinyu Gu, Haibo Chen, Binyu Zang, and Haibing Guan. Colony: A privi- leged trusted execution environment with extensibility. IEEE Transactions on Computers, 71(2):479–492, 2021

  65. [65]

    Aegisdnn: Dependable and timely execution of dnn tasks with sgx

    Yecheng Xiang, Yidi Wang, Hyunjong Choi, Mohsen Karimi, and Hyoseung Kim. Aegisdnn: Dependable and timely execution of dnn tasks with sgx. In2021 IEEE Real-Time Systems Symposium (RTSS), pages 68–81. IEEE, 2021

  66. [66]

    Lawformer: A pre-trained language model for chinese legal long documents.AI Open, 2:79– 84, 2021

    Chaojun Xiao, Xueyu Hu, Zhiyuan Liu, Cunchao Tu, and Maosong Sun. Lawformer: A pre-trained language model for chinese legal long documents.AI Open, 2:79– 84, 2021

  67. [67]

    PIXIU: A large language model, instruction data and evaluation benchmark for finance.arXiv preprint arXiv:2306.05443, 2023

    Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. Pixiu: A large language model, instruction data and evaluation benchmark for finance.arXiv preprint arXiv:2306.05443, 2023

  68. [68]

    On-device language models: A comprehensive review.arXiv preprint arXiv:2409.00088,

    Jiajun Xu, Zhiyuan Li, Wei Chen, Qun Wang, Xin Gao, Qi Cai, and Ziyuan Ling. On-device language models: A comprehensive review.arXiv preprint arXiv:2409.00088, 2024

  69. [69]

    Ui-ug: A unified mllm for ui understanding and generation.arXiv preprint arXiv:2509.24361, 2025

    Hao Yang, Weijie Qiu, Ru Zhang, Zhou Fang, Ruichao Mao, Xiaoyu Lin, Maji Huang, Zhaosong Huang, Teng Guo, Shuoyang Liu, et al. Ui-ug: A unified mllm for ui understanding and generation.arXiv preprint arXiv:2509.24361, 2025

  70. [70]

    Penetralium: Privacy- preserving and memory-efficient neural network infer- ence at the edge.Future Generation Computer Systems, 156:30–41, 2024

    Mengda Yang, Wenzhe Yi, Juan Wang, Hongxin Hu, Xiaoyang Xu, and Ziang Li. Penetralium: Privacy- preserving and memory-efficient neural network infer- ence at the edge.Future Generation Computer Systems, 156:30–41, 2024

  71. [71]

    rknn-llm

    yhcvb. rknn-llm. https://github.com/airockchip/ rknn-llm, 2025

  72. [72]

    rknpu-driver

    yhcvb. rknpu-driver. https://github.com/ airockchip/rknn-llm/tree/main/rknpu-driver, 2025

  73. [73]

    Babyagi.https://babyagi.org/, 2026

    Yohei. Babyagi.https://babyagi.org/, 2026

  74. [74]

    Ferret-ui: Grounded mobile ui understand- ing with multimodal llms

    Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. Ferret-ui: Grounded mobile ui understand- ing with multimodal llms. InEuropean Conference on Computer Vision, pages 240–255. Springer, 2024

  75. [75]

    arXiv:2509.00531 [cs.MA] https://arxiv.org/abs/2509.00531

    Cheng Zhang, Erhu Feng, Xi Zhao, Yisheng Zhao, Wangbo Gong, Jiahui Sun, Dong Du, Zhichao Hua, Yubin Xia, and Haibo Chen. Mobiagent: A system- atic framework for customizable mobile agents.arXiv preprint arXiv:2509.00531, 2025

  76. [76]

    You only look at screens: Multimodal chain-of-action agents

    Zhuosheng Zhang and Aston Zhang. You only look at screens: Multimodal chain-of-action agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 3132–3149, 2024

  77. [77]

    Enabling rack-scale confidential computing using heterogeneous trusted execution environment

    Jianping Zhu, Rui Hou, XiaoFeng Wang, Wenhao Wang, Jiangfeng Cao, Boyan Zhao, Zhongpu Wang, Yuhui Zhang, Jiameng Ying, Lixin Zhang, et al. Enabling rack-scale confidential computing using heterogeneous trusted execution environment. In2020 IEEE Sympo- sium on Security and Privacy (SP), pages 1450–1465. IEEE, 2020. 17