FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Jinyu Gu; Lixiang Wang; Yinpeng Wu; Yitong Chen; Yubin Xia; Zhichao Hua

arxiv: 2603.09046 · v2 · submitted 2026-03-10 · 💻 cs.CR · cs.LG· cs.OS

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Yinpeng Wu , Yitong Chen , Lixiang Wang , Jinyu Gu , Zhichao Hua , Yubin Xia This is my paper

Pith reviewed 2026-05-15 14:18 UTC · model grok-4.3

classification 💻 cs.CR cs.LGcs.OS

keywords mobile LLM servingTrustZonesecure inferenceflexible isolationon-device AITTFT optimizationmulti-model scheduling

0 comments

The pith

FlexServe allows ARM TrustZone to protect mobile LLM inference by switching memory and NPU modes on demand, cutting time to first token by over 10x versus rigid baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FlexServe to reduce the slowdown that TrustZone protection imposes on device-side LLM inference. Standard TrustZone isolation of memory and the NPU creates high overhead when shielding model weights and user data from a compromised OS kernel. FlexServe adds a mechanism that lets both memory pages and the NPU flip rapidly between protected and unprotected states. It then layers an LLM-aware memory manager, a secure inference pipeline, and a multi-model scheduler on top of this flexibility. The resulting system targets the gap between the privacy promise of on-device LLMs and the performance cost that currently makes them impractical.

Core claim

FlexServe constructs Flexible Secure Memory and Flexible Secure NPU through a Flexible Resource Isolation mechanism that supports fast mode switches. Inside TrustZone's secure world it adds LLM-Aware Memory Management and a Secure Inference Pipeline for single-model acceleration, plus a Multi-Model Scheduler for agent-style workflows. Prototype measurements show these changes produce large reductions in inference latency compared with both basic and pipeline-enabled TrustZone strawman designs.

What carries the argument

Flexible Resource Isolation mechanism that switches memory pages and the NPU between unprotected and protected modes

Load-bearing premise

The overhead and security properties of rapid mode switches between protected and unprotected states remain stable when measured on production mobile hardware and under realistic kernel attacks.

What would settle it

If benchmarks on additional devices with live kernel exploits show that mode-switch latency or data exposure exceeds the reported gains, the central speedup and security claims would fail.

Figures

Figures reproduced from arXiv: 2603.09046 by Jinyu Gu, Lixiang Wang, Yinpeng Wu, Yitong Chen, Yubin Xia, Zhichao Hua.

**Figure 2.** Figure 2: Breakdown of the TTFTs of normal-world inference [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: System overview of FlexServe: The Flex-Monitor constructs the Flex-Mem and Flex-NPU, and the FlexServe Framework provides a fast and secure LLM inference framework. model weights and input/output are protected. All normalworld applications are considered untrusted. FlexServe assumes the initial kernel code is benign and that secure boot protects its integrity. However, the kernel may contain bugs and cou… view at source ↗

**Figure 4.** Figure 4: Memory Protection of FlexServe. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: TTFT with different input lengths and models. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Decode throughput with different models. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: TTFT under varying background memory pressure. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: TTFT of different model groups on real-world benchmarks with a 4GB model cache. UC: UltraChat, OA: [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Response latency of real-world agent workflows. [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 11.** Figure 11: Performance overhead to the SQLite. cores. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

read the original abstract

Device-side Large Language Models (LLMs) have witnessed explosive growth, offering higher privacy and availability compared to cloud-side LLMs. During LLM inference, both model weights and user data are valuable, and attackers may even compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead due to its inflexible isolation of memory and the NPU. To address these challenges, this paper introduces FlexServe, a fast and secure LLM serving system for mobile devices. It first introduces a Flexible Resource Isolation mechanism to construct Flexible Secure Memory (Flex-Mem) and Flexible Secure NPU (Flex-NPU). Both memory pages and the NPU can be efficiently switched between unprotected and protected modes. Based on these mechanisms, FlexServe designs a fast and secure LLM inference framework within TrustZone's secure world. The LLM-Aware Memory Management and Secure Inference Pipeline are introduced to accelerate inference. A Multi-Model Scheduler is proposed to optimize multi-model workflows. We implement a prototype of FlexServe and compare it with two TrustZone-based strawman designs. The results show that FlexServe achieves an average $10.05\times$ speedup in Time to First Token (TTFT) compared to the strawman, and an average $2.44\times$ TTFT speedup compared to an optimized strawman with pipeline and secure NPU enabled. For multi-model agent workflows, the end-to-end speedup is up to $24.30\times$ and $4.05\times$ compared to the strawman and optimized strawman, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlexServe adds flexible TrustZone mode switching for mobile LLM serving and shows solid prototype speedups, but the switching overheads stay unmeasured.

read the letter

The paper's main contribution is a pair of mechanisms, Flex-Mem and Flex-NPU, that let memory pages and the NPU flip between protected and unprotected modes without the usual heavy TrustZone tax. They pair this with LLM-aware memory management, a secure inference pipeline, and a multi-model scheduler. The prototype then runs on real hardware and reports concrete numbers: roughly 10x TTFT over a basic strawman and 2.4x over an optimized one, with bigger gains in multi-model agent flows. That is useful evidence for anyone who has tried to run protected inference on phones and hit the isolation wall. The implementation looks honest; they actually built it and compared against two TrustZone baselines rather than just claiming theoretical wins. The multi-model scheduler is a practical addition that addresses a real workload pattern. The soft spot is exactly what the stress test flags. The speedups rest on the assumption that mode switches are cheap, yet the paper gives no microbenchmark for switch latency, no count of switches per token, and no ablation that isolates the switching cost from the other optimizations. In longer multi-model runs even small per-switch costs could shrink the advantage. The abstract also skips error bars and workload details, so the numbers are harder to judge without the full experimental section. This is for systems people who care about on-device security and performance. It has a working prototype and addresses a clear pain point, so it deserves a serious referee rather than a desk reject. The reviewers can push on the missing overhead data and the experimental rigor, but the core idea is worth the time.

Referee Report

2 major / 2 minor

Summary. The paper presents FlexServe, a secure LLM serving system for mobile devices that uses ARM TrustZone with a new Flexible Resource Isolation mechanism. This enables efficient dynamic switching of memory pages (Flex-Mem) and the NPU (Flex-NPU) between protected and unprotected modes. Building on these, the system adds LLM-Aware Memory Management, a Secure Inference Pipeline, and a Multi-Model Scheduler. A prototype implementation is evaluated against two TrustZone-based strawman designs, reporting average TTFT speedups of 10.05× versus the basic strawman and 2.44× versus an optimized strawman (with pipeline and secure NPU), plus end-to-end gains up to 24.30× and 4.05× for multi-model agent workflows.

Significance. If the performance claims are supported by complete characterization of mode-switching costs, this work would be significant for practical on-device LLM deployment. It directly addresses the tension between strong hardware isolation (TrustZone) and inference efficiency on resource-constrained mobile devices, offering a concrete prototype that demonstrates flexible isolation can deliver substantial speedups while maintaining security guarantees.

major comments (2)

[Evaluation] Evaluation section: The headline TTFT claims (10.05× vs strawman, 2.44× vs optimized strawman) and multi-model gains (up to 24.30× / 4.05×) attribute improvements to Flexible Resource Isolation, yet no microbenchmark data, switch counts per inference step, or ablation isolating Flex-Mem/Flex-NPU switching latency from LLM-Aware Memory Management or the pipeline is provided. Without these, it is impossible to confirm that mode-switching overheads (e.g., TLB invalidation or NPU reconfiguration) are negligible relative to inference time.
[§4.3] §4.3 (Secure Inference Pipeline): The integration of Flex-NPU mode switching with pipeline stages is described at a high level, but the paper does not quantify reconfiguration costs or their accumulation across token generation steps. This is load-bearing for the central claim that flexible isolation accelerates inference without eroding the reported speedups.

minor comments (2)

[Abstract] The abstract and introduction refer to 'strawman designs' without a concise summary of their key limitations; adding one sentence would improve accessibility for readers.
[Evaluation] Performance figures lack error bars, standard deviations, or details on workload selection and measurement methodology, which are standard for empirical systems papers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the evaluation. We agree that additional microbenchmark data and quantifications will strengthen the paper and will revise the manuscript accordingly to address both major points.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The headline TTFT claims (10.05× vs strawman, 2.44× vs optimized strawman) and multi-model gains (up to 24.30× / 4.05×) attribute improvements to Flexible Resource Isolation, yet no microbenchmark data, switch counts per inference step, or ablation isolating Flex-Mem/Flex-NPU switching latency from LLM-Aware Memory Management or the pipeline is provided. Without these, it is impossible to confirm that mode-switching overheads (e.g., TLB invalidation or NPU reconfiguration) are negligible relative to inference time.

Authors: We agree that microbenchmark data would better isolate contributions and confirm negligible overheads. In the revised manuscript we will add: (1) microbenchmarks measuring Flex-Mem and Flex-NPU switching latencies including TLB invalidation and NPU reconfiguration costs; (2) the exact number of mode switches per inference step for representative workloads; and (3) an ablation study separating Flexible Resource Isolation from LLM-Aware Memory Management and the pipeline. These additions will directly show that switching costs remain negligible relative to inference time and support the reported speedups. revision: yes
Referee: [§4.3] §4.3 (Secure Inference Pipeline): The integration of Flex-NPU mode switching with pipeline stages is described at a high level, but the paper does not quantify reconfiguration costs or their accumulation across token generation steps. This is load-bearing for the central claim that flexible isolation accelerates inference without eroding the reported speedups.

Authors: We acknowledge the need for explicit quantification. In the revision we will expand §4.3 with measured Flex-NPU reconfiguration latencies and an analysis of their cumulative impact across successive token-generation steps. The new data will demonstrate that these costs do not erode the overall speedups delivered by flexible isolation, thereby reinforcing the central performance claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical prototype benchmarks

full rationale

The paper describes a systems implementation (Flexible Resource Isolation, LLM-Aware Memory Management, Secure Inference Pipeline, Multi-Model Scheduler) and reports measured speedups from a prototype against strawman baselines. No equations, first-principles derivations, or predictions appear that reduce by construction to fitted inputs or self-referential definitions. Performance numbers are direct experimental results, not outputs of any model that was calibrated on the same quantities. Self-citations, if present, are not load-bearing for the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the introduction of two new mechanisms (Flex-Mem and Flex-NPU) without independent evidence beyond the prototype. It relies on the standard assumption that TrustZone provides effective isolation.

axioms (1)

domain assumption ARM TrustZone provides hardware-based isolation between secure and normal worlds that protects against a compromised OS kernel.
Invoked as the foundation for all secure inference claims.

invented entities (2)

Flex-Mem no independent evidence
purpose: Flexible secure memory that can be efficiently switched between protected and unprotected modes.
New mechanism introduced to reduce isolation overhead for LLM weights and data.
Flex-NPU no independent evidence
purpose: Flexible secure NPU that can be efficiently switched between protected and unprotected modes.
New mechanism introduced to reduce overhead for AI acceleration during secure inference.

pith-pipeline@v0.9.0 · 5627 in / 1409 out tokens · 68263 ms · 2026-05-15T14:18:35.483668+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 3 internal anchors

[1]

https://www.apple.com/ apple-intelligence/, Sep, 2025

Apple intelligence. https://www.apple.com/ apple-intelligence/, Sep, 2025

work page 2025
[2]

https://www.samsung.com/us/ galaxy-ai/, Sep, 2025

Galaxy ai. https://www.samsung.com/us/ galaxy-ai/, Sep, 2025

work page 2025
[3]

https://www.cvedetails.com/ version-list/33/47/1/Linux-Linux-Kernel

Linux cves. https://www.cvedetails.com/ version-list/33/47/1/Linux-Linux-Kernel. html, Sep, 2025

work page 2025
[4]

https://www.stackscale.com/blog/ linux-kernel-surpasses-40-million-lines-code/ , Sep, 2025

The linux kernel surpasses 40 million lines of code: A historic nilestone in open-source soft- ware. https://www.stackscale.com/blog/ linux-kernel-surpasses-40-million-lines-code/ , Sep, 2025

work page 2025
[5]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Har- rison, Russell J Hewett, Mojan Javaheripi, Piero Kauff- mann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

stress-ng

Aboorva Devarajan Abdul Haleem and so on. stress-ng. https://github.com/ColinIanKing/ stress-ng, 2020

work page 2020
[7]

Trustzone: Integrated hard- ware and software security.ARM white paper, 3(4):18– 24, 2004

Tiago Alves and Don Felton. Trustzone: Integrated hard- ware and software security.ARM white paper, 3(4):18– 24, 2004

work page 2004
[8]

Android virtualiza- tion framework (avf) overview

Android. Android virtualiza- tion framework (avf) overview. https://source.android.com/docs/core/virtualization, 2026

work page 2026
[9]

Memory allocation among processes

Android. Memory allocation among processes. https://developer.android.com/topic/ performance/memory-management, 2026

work page 2026
[10]

Overview of memory management

Android. Overview of memory management. https://developer.android.com/topic/ performance/memory-overview, 2026

work page 2026
[11]

What is the autogpt platform? https:// agpt.co/docs/platform, 2026

AutoGPT. What is the autogpt platform? https:// agpt.co/docs/platform, 2026

work page 2026
[12]

Skee: A lightweight secure kernel-level execution environment for arm

Ahmed M Azab, Kirk Swidowski, Jia Ma Bhutkar, Wenbo Shen, Ruowen Wang, and Peng Ning. Skee: A lightweight secure kernel-level execution environment for arm. InNetwork & Distributed System Security Symposium (NDSS), 2016

work page 2016
[13]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Sanctuary: Arming trustzone with user-space enclaves

Ferdinand Brasser, David Gens, Patrick Jauernig, Ahmad-Reza Sadeghi, and Emmanuel Stapf. Sanctuary: Arming trustzone with user-space enclaves. 2019

work page 2019
[15]

Char- acterizing mobile soc for accelerating heterogeneous llm inference

Le Chen, Dahu Feng, Erhu Feng, Yingrui Wang, Rong Zhao, Yubin Xia, Pinjie Xu, and Haibo Chen. Char- acterizing mobile soc for accelerating heterogeneous llm inference. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 359– 374, 2025

work page 2025
[16]

Hardware-assisted on-demand hypervisor activation for efficient security critical code execution on mobile de- vices

Yeongpil Cho, Junbum Shin, Donghyun Kwon, MyungJoo Ham, Yuna Kim, and Yunheung Paek. Hardware-assisted on-demand hypervisor activation for efficient security critical code execution on mobile de- vices. In2016 USENIX Annual Technical Conference (USENIX ATC 16), pages 565–578. USENIX Associa- tion, 2016

work page 2016
[17]

Intel sgx explained

Victor Costan and Srinivas Devadas. Intel sgx explained. Cryptology ePrint Archive, 2016

work page 2016
[18]

The rising costs of training frontier ai models,

Ben Cottier, Robi Rahman, Loredana Fattorini, Nestor Maslej, Tamay Besiroglu, and David Owen. The ris- ing costs of training frontier ai models.arXiv preprint arXiv:2405.21015, 2024

work page arXiv 2024
[19]

Strongbox: A gpu tee on arm endpoints

Yunjie Deng, Chenxu Wang, Shunchang Yu, Shiqing Liu, Zhenyu Ning, Kevin Leach, Jin Li, Shoumeng Yan, Zhengyu He, Jiannong Cao, et al. Strongbox: A gpu tee on arm endpoints. InProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 769–783, 2022

work page 2022
[20]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

work page 2024
[21]

Serdab: An iot framework for partitioning neural networks computa- tion across multiple enclaves

Tarek Elgamal and Klara Nahrstedt. Serdab: An iot framework for partitioning neural networks computa- tion across multiple enclaves. In2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pages 519–528. IEEE, 2020. 14

work page 2020
[22]

Xputee: a high-performance and practical heteroge- neous trusted execution environment for gpus.ACM Transactions on Computer Systems, 43(1-2):1–27, 2025

Shulin Fan, Zhichao Hua, Yubin Xia, and Haibo Chen. Xputee: a high-performance and practical heteroge- neous trusted execution environment for gpus.ACM Transactions on Computer Systems, 43(1-2):1–27, 2025

work page 2025
[23]

AI4Finance Foundation. Fingpt. https:// huggingface.co/FinGPT, 2026

work page 2026
[24]

On-device small language models with multi- modality, rag, and function calling, 2026

Google. On-device small language models with multi- modality, rag, and function calling, 2026

work page 2026
[25]

Privado: Prac- tical and secure dnn inference with enclaves.arXiv preprint arXiv:1810.00602, 2018

Karan Grover, Shruti Tople, Shweta Shinde, Ranjita Bhagwan, and Ramachandran Ramjee. Privado: Prac- tical and secure dnn inference with enclaves.arXiv preprint arXiv:1810.00602, 2018

work page arXiv 2018
[26]

Trustshadow: Se- cure execution of unmodified applications with arm trustzone

Le Guan, Peng Liu, Xinyu Xing, Xinyang Ge, Shengzhi Zhang, Meng Yu, and Trent Jaeger. Trustshadow: Se- cure execution of unmodified applications with arm trustzone. InProceedings of the 15th Annual Inter- national Conference on Mobile Systems, Applications, and Services, pages 488–501, 2017

work page 2017
[27]

Richard Hipp

D. Richard Hipp. Sqlite. https://www.sqlite.org/. Version 3.x, accessed 2024-05-10

work page 2024
[28]

{vTZ}: virtualizing {ARM}{TrustZone}

Zhichao Hua, Jinyu Gu, Yubin Xia, Haibo Chen, Binyu Zang, and Haibing Guan. {vTZ}: virtualizing {ARM}{TrustZone}. In26th USENIX Security Sympo- sium (USENIX Security 17), pages 541–556, 2017

work page 2017
[29]

Rossbach, and Emmett Witchel

Tyler Hunt, Zhipeng Jia, Vance Miller, Ariel Szekely, Yige Hu, Christopher J. Rossbach, and Emmett Witchel. Telekine: Secure computing with cloud GPUs. In17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 817–833, Santa Clara, CA, 2020. USENIX Association

work page 2020
[30]

Confidential execution of deep learning inference at the untrusted edge with arm trustzone

Md Shihabul Islam, Mahmoud Zamani, Chung Hwan Kim, Latifur Khan, and Kevin W Hamlen. Confidential execution of deep learning inference at the untrusted edge with arm trustzone. InProceedings of the Thir- teenth ACM Conference on Data and Application Secu- rity and Privacy, pages 153–164, 2023

work page 2023
[31]

SAGE: Software-based attestation for GPU execu- tion

Andrei Ivanov, Benjamin Rothenberger, Arnaud De- thise, Marco Canini, Torsten Hoefler, and Adrian Per- rig. SAGE: Software-based attestation for GPU execu- tion. In2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 485–499, Boston, MA, July

work page
[32]

Heterogeneous isolated execution for commodity gpus

Insu Jang, Adrian Tang, Taehoon Kim, Simha Sethu- madhavan, and Jaehyuk Huh. Heterogeneous isolated execution for commodity gpus. InProceedings of the Twenty-Fourth International Conference on Architec- tural Support for Programming Languages and Operat- ing Systems, pages 455–468, 2019

work page 2019
[33]

Smartzone: Runtime sup- port for secure and efficient on-device inference on arm trustzone.IEEE Transactions on Computers, 2025

Zhaolong Jian, Xu Liu, Qiankun Dong, Longkai Cheng, Xueshuo Xie, and Tao Li. Smartzone: Runtime sup- port for secure and efficient on-device inference on arm trustzone.IEEE Transactions on Computers, 2025

work page 2025
[34]

and Raffel, C

Nikhil Kandpal and Colin Raffel. Position: The most expensive part of an llm should be its training data. arXiv preprint arXiv:2504.12427, 2025

work page arXiv 2025
[35]

Gonza- lez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with pagedat- tention, 2023

work page 2023
[36]

Occlumency: Privacy-preserving remote deep-learning inference us- ing sgx

Taegyeong Lee, Zhiqi Lin, Saumay Pushp, Caihua Li, Yunxin Liu, Youngki Lee, Fengyuan Xu, Chenren Xu, Lintao Zhang, and Junehwa Song. Occlumency: Privacy-preserving remote deep-learning inference us- ing sgx. InThe 25th Annual International Conference on Mobile Computing and Networking, pages 1–17, 2019

work page 2019
[37]

Apple intelligence foundation language models: Tech report 2025.arXiv preprint arXiv:2507.13575, 2025

Ethan Li, Anders Boesen Lindbo Larsen, Chen Zhang, Xiyou Zhou, Jun Qin, Dian Ang Yap, Narendran Ragha- van, Xuankai Chang, Margit Bowler, Eray Yildiz, et al. Apple intelligence foundation language models: Tech report 2025.arXiv preprint arXiv:2507.13575, 2025

work page arXiv 2025
[38]

Translinkguard: safeguard- ing transformer models against model stealing in edge deployment

Qinfeng Li, Zhiqiang Shen, Zhenghan Qin, Yangfan Xie, Xuhong Zhang, Tianyu Du, Sheng Cheng, Xun Wang, and Jianwei Yin. Translinkguard: safeguard- ing transformer models against model stealing in edge deployment. InProceedings of the 32nd ACM Inter- national Conference on Multimedia, pages 3479–3488, 2024

work page 2024
[39]

Adat- tester: Secure online mobile advertisement attestation using trustzone

Wenhao Li, Haibo Li, Haibo Chen, and Yubin Xia. Adat- tester: Secure online mobile advertisement attestation using trustzone. InProceedings of the 13th annual in- ternational conference on mobile systems, applications, and services, pages 75–88, 2015

work page 2015
[40]

Build- ing trusted path on untrusted device drivers for mobile devices

Wenhao Li, Mingyang Ma, Jinchen Han, Yubin Xia, Binyu Zang, Cheng-Kang Chu, and Tieyan Li. Build- ing trusted path on untrusted device drivers for mobile devices. InProceedings of 5th Asia-Pacific Workshop on Systems, pages 1–7, 2014

work page 2014
[41]

Large language models on mobile devices: Measurements, analysis, and insights

Xiang Li, Zhenyan Lu, Dongqi Cai, Xiao Ma, and Meng- wei Xu. Large language models on mobile devices: Measurements, analysis, and insights. InProceedings of the Workshop on Edge and Mobile Foundation Models, pages 1–6, 2024

work page 2024
[42]

Robust safe reinforcement learning under adversarial disturbances

Zeyang Li, Chuxiong Hu, Shengbo Eben Li, Jia Cheng, and Yunan Wang. Robust safe reinforcement learning under adversarial disturbances. In2023 62nd IEEE 15 Conference on Decision and Control (CDC), pages 334–

work page
[43]

Ferret-ui 2: Mastering universal user interface understanding across platforms.arXiv preprint arXiv:2410.18967, 2024

Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moor- thy, Jeff Nichols, Yinfei Yang, and Zhe Gan. Ferret-ui 2: Mastering universal user interface understanding across platforms.arXiv preprint arXiv:2410.18967, 2024

work page arXiv 2024
[44]

OP-TEE: Open Portable Trusted Execution Environment

Linaro and Contributors. OP-TEE: Open Portable Trusted Execution Environment. GitHub repository, 2025

work page 2025
[45]

Tz-kms: A secure key management service for joint cloud com- puting with arm trustzone

Shiyu Luo, Zhichao Hua, and Yubin Xia. Tz-kms: A secure key management service for joint cloud com- puting with arm trustzone. In2018 IEEE Symposium on Service-Oriented System Engineering (SOSE), pages 180–185. IEEE, 2018

work page 2018
[46]

Honeycomb: Secure and efficient {GPU} executions via static valida- tion

Haohui Mai, Jiacheng Zhao, Hongren Zheng, Yiyang Zhao, Zibin Liu, Mingyu Gao, Cong Wang, Huimin Cui, Xiaobing Feng, and Christos Kozyrakis. Honeycomb: Secure and efficient {GPU} executions via static valida- tion. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 155–172, 2023

work page 2023
[47]

Darknetz: towards model privacy at the edge using trusted execution environments

Fan Mo, Ali Shahin Shamsabadi, Kleomenis Katevas, Soteris Demetriou, Ilias Leontiadis, Andrea Cavallaro, and Hamed Haddadi. Darknetz: towards model privacy at the edge using trusted execution environments. In Proceedings of the 18th International Conference on Mobile Systems, Applications, and Services, pages 161– 174, 2020

work page 2020
[48]

rknn-llm

mtx512. rknn-llm. https://github.com/mtx512/ rk3588-npu, 2023

work page 2023
[49]

Basics on android ram management, what is(n’t) bloat? https://www.reddit.com/r/ GalaxyS8/comments/6agads/basics_on_android_ ram_management_what_isnt_bloat/, 2026

neomancr. Basics on android ram management, what is(n’t) bloat? https://www.reddit.com/r/ GalaxyS8/comments/6agads/basics_on_android_ ram_management_what_isnt_bloat/, 2026

work page 2026
[50]

The ai workspace that works for you

Notion. The ai workspace that works for you. https: //www.notion.com/product/ai, 2026

work page 2026
[51]

Oblivious {Multi-Party} machine learn- ing on trusted processors

Olga Ohrimenko, Felix Schuster, Cédric Fournet, Aastha Mehta, Sebastian Nowozin, Kapil Vaswani, and Manuel Costa. Oblivious {Multi-Party} machine learn- ing on trusted processors. In25th USENIX Security Sym- posium (USENIX Security 16), pages 619–636, 2016

work page 2016
[52]

Safe and practical gpu computation in trustzone

Heejin Park and Felix Xiaozhu Lin. Safe and practical gpu computation in trustzone. InProceedings of the Eighteenth European Conference on Computer Systems, pages 505–520, 2023

work page 2023
[53]

The ai companion who cares always here to listen and talk.https://replika.ai/, 2026

Replika. The ai companion who cares always here to listen and talk.https://replika.ai/, 2026

work page 2026
[54]

Using arm trustzone to build a trusted lan- guage runtime for mobile applications

Nuno Santos, Himanshu Raj, Stefan Saroiu, and Alec Wolman. Using arm trustzone to build a trusted lan- guage runtime for mobile applications. InProceedings of the 19th international conference on Architectural support for programming languages and operating sys- tems, pages 67–80, 2014

work page 2014
[55]

ennclave: Offline inference with model confidentiality

Alexander Schlögl and Rainer Böhme. ennclave: Offline inference with model confidentiality. InProceedings of the 13th ACM Workshop on Artificial Intelligence and Security, pages 93–104, 2020

work page 2020
[56]

In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 723–738, 2022

Tianxiang Shen, Ji Qi, Jianyu Jiang, Xian Wang, Siyuan Wen, Xusheng Chen, Shixiong Zhao, Sen Wang, Li Chen, Xiapu Luo, et al.{SOTER}: Guarding black- box inference for general neural networks at the edge. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 723–738, 2022

work page 2022
[57]

Standard Performance Evaluation Corporation (SPEC), Gainesville, V A, USA.SPEC CPU® 2017 Benchmark Suite, 2017.https://www.spec.org/cpu2017/

work page 2017
[58]

Trustice: Hardware-assisted isolated comput- ing environments on mobile devices

He Sun, Kun Sun, Yuewu Wang, Jiwu Jing, and Haining Wang. Trustice: Hardware-assisted isolated comput- ing environments on mobile devices. InDependable Systems and Networks (DSN), 2015 45th Annual IEEE/I- FIP International Conference on, pages 367–378. IEEE, 2015

work page 2015
[59]

Shad- ownet: A secure and efficient on-device model inference system for convolutional neural networks

Zhichuang Sun, Ruimin Sun, Changming Liu, Am- rita Roy Chowdhury, Long Lu, and Somesh Jha. Shad- ownet: A secure and efficient on-device model inference system for convolutional neural networks. In2023 IEEE Symposium on Security and Privacy (SP), pages 1596–

work page
[60]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Riv- ière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Graviton: Trusted execution environments on gpus

Stavros V olos, Kapil Vaswani, and Rodrigo Bruno. Graviton: Trusted execution environments on gpus. In OSDI, pages 681–696, 2018

work page 2018
[62]

Tz-llm: Protecting on- device large language models with arm trustzone.arXiv preprint arXiv:2511.13717, 2025

Xunjie Wang, Jiacheng Shi, Zihan Zhao, Yang Yu, Zhichao Hua, and Jinyu Gu. Tz-llm: Protecting on- device large language models with arm trustzone.arXiv preprint arXiv:2511.13717, 2025

work page arXiv 2025
[63]

Building gpu tees using cpu secure enclaves with gevi- sor

Xiaolong Wu, Dave Jing Tian, and Chung Hwan Kim. Building gpu tees using cpu secure enclaves with gevi- sor. InProceedings of the 2023 ACM Symposium on Cloud Computing, pages 249–264, 2023. 16

work page 2023
[64]

Colony: A privi- leged trusted execution environment with extensibility

Yubin Xia, Zhichao Hua, Yang Yu, Jinyu Gu, Haibo Chen, Binyu Zang, and Haibing Guan. Colony: A privi- leged trusted execution environment with extensibility. IEEE Transactions on Computers, 71(2):479–492, 2021

work page 2021
[65]

Aegisdnn: Dependable and timely execution of dnn tasks with sgx

Yecheng Xiang, Yidi Wang, Hyunjong Choi, Mohsen Karimi, and Hyoseung Kim. Aegisdnn: Dependable and timely execution of dnn tasks with sgx. In2021 IEEE Real-Time Systems Symposium (RTSS), pages 68–81. IEEE, 2021

work page 2021
[66]

Lawformer: A pre-trained language model for chinese legal long documents.AI Open, 2:79– 84, 2021

Chaojun Xiao, Xueyu Hu, Zhiyuan Liu, Cunchao Tu, and Maosong Sun. Lawformer: A pre-trained language model for chinese legal long documents.AI Open, 2:79– 84, 2021

work page 2021
[67]

PIXIU: A large language model, instruction data and evaluation benchmark for finance.arXiv preprint arXiv:2306.05443, 2023

Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. Pixiu: A large language model, instruction data and evaluation benchmark for finance.arXiv preprint arXiv:2306.05443, 2023

work page arXiv 2023
[68]

On-device language models: A comprehensive review.arXiv preprint arXiv:2409.00088,

Jiajun Xu, Zhiyuan Li, Wei Chen, Qun Wang, Xin Gao, Qi Cai, and Ziyuan Ling. On-device language models: A comprehensive review.arXiv preprint arXiv:2409.00088, 2024

work page arXiv 2024
[69]

Ui-ug: A unified mllm for ui understanding and generation.arXiv preprint arXiv:2509.24361, 2025

Hao Yang, Weijie Qiu, Ru Zhang, Zhou Fang, Ruichao Mao, Xiaoyu Lin, Maji Huang, Zhaosong Huang, Teng Guo, Shuoyang Liu, et al. Ui-ug: A unified mllm for ui understanding and generation.arXiv preprint arXiv:2509.24361, 2025

work page arXiv 2025
[70]

Penetralium: Privacy- preserving and memory-efficient neural network infer- ence at the edge.Future Generation Computer Systems, 156:30–41, 2024

Mengda Yang, Wenzhe Yi, Juan Wang, Hongxin Hu, Xiaoyang Xu, and Ziang Li. Penetralium: Privacy- preserving and memory-efficient neural network infer- ence at the edge.Future Generation Computer Systems, 156:30–41, 2024

work page 2024
[71]

rknn-llm

yhcvb. rknn-llm. https://github.com/airockchip/ rknn-llm, 2025

work page 2025
[72]

rknpu-driver

yhcvb. rknpu-driver. https://github.com/ airockchip/rknn-llm/tree/main/rknpu-driver, 2025

work page 2025
[73]

Babyagi.https://babyagi.org/, 2026

Yohei. Babyagi.https://babyagi.org/, 2026

work page 2026
[74]

Ferret-ui: Grounded mobile ui understand- ing with multimodal llms

Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. Ferret-ui: Grounded mobile ui understand- ing with multimodal llms. InEuropean Conference on Computer Vision, pages 240–255. Springer, 2024

work page 2024
[75]

arXiv:2509.00531 [cs.MA] https://arxiv.org/abs/2509.00531

Cheng Zhang, Erhu Feng, Xi Zhao, Yisheng Zhao, Wangbo Gong, Jiahui Sun, Dong Du, Zhichao Hua, Yubin Xia, and Haibo Chen. Mobiagent: A system- atic framework for customizable mobile agents.arXiv preprint arXiv:2509.00531, 2025

work page arXiv 2025
[76]

You only look at screens: Multimodal chain-of-action agents

Zhuosheng Zhang and Aston Zhang. You only look at screens: Multimodal chain-of-action agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 3132–3149, 2024

work page 2024
[77]

Enabling rack-scale confidential computing using heterogeneous trusted execution environment

Jianping Zhu, Rui Hou, XiaoFeng Wang, Wenhao Wang, Jiangfeng Cao, Boyan Zhao, Zhongpu Wang, Yuhui Zhang, Jiameng Ying, Lixin Zhang, et al. Enabling rack-scale confidential computing using heterogeneous trusted execution environment. In2020 IEEE Sympo- sium on Security and Privacy (SP), pages 1450–1465. IEEE, 2020. 17

work page 2020

[1] [1]

https://www.apple.com/ apple-intelligence/, Sep, 2025

Apple intelligence. https://www.apple.com/ apple-intelligence/, Sep, 2025

work page 2025

[2] [2]

https://www.samsung.com/us/ galaxy-ai/, Sep, 2025

Galaxy ai. https://www.samsung.com/us/ galaxy-ai/, Sep, 2025

work page 2025

[3] [3]

https://www.cvedetails.com/ version-list/33/47/1/Linux-Linux-Kernel

Linux cves. https://www.cvedetails.com/ version-list/33/47/1/Linux-Linux-Kernel. html, Sep, 2025

work page 2025

[4] [4]

https://www.stackscale.com/blog/ linux-kernel-surpasses-40-million-lines-code/ , Sep, 2025

The linux kernel surpasses 40 million lines of code: A historic nilestone in open-source soft- ware. https://www.stackscale.com/blog/ linux-kernel-surpasses-40-million-lines-code/ , Sep, 2025

work page 2025

[5] [5]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Har- rison, Russell J Hewett, Mojan Javaheripi, Piero Kauff- mann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

stress-ng

Aboorva Devarajan Abdul Haleem and so on. stress-ng. https://github.com/ColinIanKing/ stress-ng, 2020

work page 2020

[7] [7]

Trustzone: Integrated hard- ware and software security.ARM white paper, 3(4):18– 24, 2004

Tiago Alves and Don Felton. Trustzone: Integrated hard- ware and software security.ARM white paper, 3(4):18– 24, 2004

work page 2004

[8] [8]

Android virtualiza- tion framework (avf) overview

Android. Android virtualiza- tion framework (avf) overview. https://source.android.com/docs/core/virtualization, 2026

work page 2026

[9] [9]

Memory allocation among processes

Android. Memory allocation among processes. https://developer.android.com/topic/ performance/memory-management, 2026

work page 2026

[10] [10]

Overview of memory management

Android. Overview of memory management. https://developer.android.com/topic/ performance/memory-overview, 2026

work page 2026

[11] [11]

What is the autogpt platform? https:// agpt.co/docs/platform, 2026

AutoGPT. What is the autogpt platform? https:// agpt.co/docs/platform, 2026

work page 2026

[12] [12]

Skee: A lightweight secure kernel-level execution environment for arm

Ahmed M Azab, Kirk Swidowski, Jia Ma Bhutkar, Wenbo Shen, Ruowen Wang, and Peng Ning. Skee: A lightweight secure kernel-level execution environment for arm. InNetwork & Distributed System Security Symposium (NDSS), 2016

work page 2016

[13] [13]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Sanctuary: Arming trustzone with user-space enclaves

Ferdinand Brasser, David Gens, Patrick Jauernig, Ahmad-Reza Sadeghi, and Emmanuel Stapf. Sanctuary: Arming trustzone with user-space enclaves. 2019

work page 2019

[15] [15]

Char- acterizing mobile soc for accelerating heterogeneous llm inference

Le Chen, Dahu Feng, Erhu Feng, Yingrui Wang, Rong Zhao, Yubin Xia, Pinjie Xu, and Haibo Chen. Char- acterizing mobile soc for accelerating heterogeneous llm inference. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 359– 374, 2025

work page 2025

[16] [16]

Hardware-assisted on-demand hypervisor activation for efficient security critical code execution on mobile de- vices

Yeongpil Cho, Junbum Shin, Donghyun Kwon, MyungJoo Ham, Yuna Kim, and Yunheung Paek. Hardware-assisted on-demand hypervisor activation for efficient security critical code execution on mobile de- vices. In2016 USENIX Annual Technical Conference (USENIX ATC 16), pages 565–578. USENIX Associa- tion, 2016

work page 2016

[17] [17]

Intel sgx explained

Victor Costan and Srinivas Devadas. Intel sgx explained. Cryptology ePrint Archive, 2016

work page 2016

[18] [18]

The rising costs of training frontier ai models,

Ben Cottier, Robi Rahman, Loredana Fattorini, Nestor Maslej, Tamay Besiroglu, and David Owen. The ris- ing costs of training frontier ai models.arXiv preprint arXiv:2405.21015, 2024

work page arXiv 2024

[19] [19]

Strongbox: A gpu tee on arm endpoints

Yunjie Deng, Chenxu Wang, Shunchang Yu, Shiqing Liu, Zhenyu Ning, Kevin Leach, Jin Li, Shoumeng Yan, Zhengyu He, Jiannong Cao, et al. Strongbox: A gpu tee on arm endpoints. InProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 769–783, 2022

work page 2022

[20] [20]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

work page 2024

[21] [21]

Serdab: An iot framework for partitioning neural networks computa- tion across multiple enclaves

Tarek Elgamal and Klara Nahrstedt. Serdab: An iot framework for partitioning neural networks computa- tion across multiple enclaves. In2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pages 519–528. IEEE, 2020. 14

work page 2020

[22] [22]

Xputee: a high-performance and practical heteroge- neous trusted execution environment for gpus.ACM Transactions on Computer Systems, 43(1-2):1–27, 2025

Shulin Fan, Zhichao Hua, Yubin Xia, and Haibo Chen. Xputee: a high-performance and practical heteroge- neous trusted execution environment for gpus.ACM Transactions on Computer Systems, 43(1-2):1–27, 2025

work page 2025

[23] [23]

AI4Finance Foundation. Fingpt. https:// huggingface.co/FinGPT, 2026

work page 2026

[24] [24]

On-device small language models with multi- modality, rag, and function calling, 2026

Google. On-device small language models with multi- modality, rag, and function calling, 2026

work page 2026

[25] [25]

Privado: Prac- tical and secure dnn inference with enclaves.arXiv preprint arXiv:1810.00602, 2018

Karan Grover, Shruti Tople, Shweta Shinde, Ranjita Bhagwan, and Ramachandran Ramjee. Privado: Prac- tical and secure dnn inference with enclaves.arXiv preprint arXiv:1810.00602, 2018

work page arXiv 2018

[26] [26]

Trustshadow: Se- cure execution of unmodified applications with arm trustzone

Le Guan, Peng Liu, Xinyu Xing, Xinyang Ge, Shengzhi Zhang, Meng Yu, and Trent Jaeger. Trustshadow: Se- cure execution of unmodified applications with arm trustzone. InProceedings of the 15th Annual Inter- national Conference on Mobile Systems, Applications, and Services, pages 488–501, 2017

work page 2017

[27] [27]

Richard Hipp

D. Richard Hipp. Sqlite. https://www.sqlite.org/. Version 3.x, accessed 2024-05-10

work page 2024

[28] [28]

{vTZ}: virtualizing {ARM}{TrustZone}

Zhichao Hua, Jinyu Gu, Yubin Xia, Haibo Chen, Binyu Zang, and Haibing Guan. {vTZ}: virtualizing {ARM}{TrustZone}. In26th USENIX Security Sympo- sium (USENIX Security 17), pages 541–556, 2017

work page 2017

[29] [29]

Rossbach, and Emmett Witchel

Tyler Hunt, Zhipeng Jia, Vance Miller, Ariel Szekely, Yige Hu, Christopher J. Rossbach, and Emmett Witchel. Telekine: Secure computing with cloud GPUs. In17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 817–833, Santa Clara, CA, 2020. USENIX Association

work page 2020

[30] [30]

Confidential execution of deep learning inference at the untrusted edge with arm trustzone

Md Shihabul Islam, Mahmoud Zamani, Chung Hwan Kim, Latifur Khan, and Kevin W Hamlen. Confidential execution of deep learning inference at the untrusted edge with arm trustzone. InProceedings of the Thir- teenth ACM Conference on Data and Application Secu- rity and Privacy, pages 153–164, 2023

work page 2023

[31] [31]

SAGE: Software-based attestation for GPU execu- tion

Andrei Ivanov, Benjamin Rothenberger, Arnaud De- thise, Marco Canini, Torsten Hoefler, and Adrian Per- rig. SAGE: Software-based attestation for GPU execu- tion. In2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 485–499, Boston, MA, July

work page

[32] [32]

Heterogeneous isolated execution for commodity gpus

Insu Jang, Adrian Tang, Taehoon Kim, Simha Sethu- madhavan, and Jaehyuk Huh. Heterogeneous isolated execution for commodity gpus. InProceedings of the Twenty-Fourth International Conference on Architec- tural Support for Programming Languages and Operat- ing Systems, pages 455–468, 2019

work page 2019

[33] [33]

Smartzone: Runtime sup- port for secure and efficient on-device inference on arm trustzone.IEEE Transactions on Computers, 2025

Zhaolong Jian, Xu Liu, Qiankun Dong, Longkai Cheng, Xueshuo Xie, and Tao Li. Smartzone: Runtime sup- port for secure and efficient on-device inference on arm trustzone.IEEE Transactions on Computers, 2025

work page 2025

[34] [34]

and Raffel, C

Nikhil Kandpal and Colin Raffel. Position: The most expensive part of an llm should be its training data. arXiv preprint arXiv:2504.12427, 2025

work page arXiv 2025

[35] [35]

Gonza- lez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with pagedat- tention, 2023

work page 2023

[36] [36]

Occlumency: Privacy-preserving remote deep-learning inference us- ing sgx

Taegyeong Lee, Zhiqi Lin, Saumay Pushp, Caihua Li, Yunxin Liu, Youngki Lee, Fengyuan Xu, Chenren Xu, Lintao Zhang, and Junehwa Song. Occlumency: Privacy-preserving remote deep-learning inference us- ing sgx. InThe 25th Annual International Conference on Mobile Computing and Networking, pages 1–17, 2019

work page 2019

[37] [37]

Apple intelligence foundation language models: Tech report 2025.arXiv preprint arXiv:2507.13575, 2025

Ethan Li, Anders Boesen Lindbo Larsen, Chen Zhang, Xiyou Zhou, Jun Qin, Dian Ang Yap, Narendran Ragha- van, Xuankai Chang, Margit Bowler, Eray Yildiz, et al. Apple intelligence foundation language models: Tech report 2025.arXiv preprint arXiv:2507.13575, 2025

work page arXiv 2025

[38] [38]

Translinkguard: safeguard- ing transformer models against model stealing in edge deployment

Qinfeng Li, Zhiqiang Shen, Zhenghan Qin, Yangfan Xie, Xuhong Zhang, Tianyu Du, Sheng Cheng, Xun Wang, and Jianwei Yin. Translinkguard: safeguard- ing transformer models against model stealing in edge deployment. InProceedings of the 32nd ACM Inter- national Conference on Multimedia, pages 3479–3488, 2024

work page 2024

[39] [39]

Adat- tester: Secure online mobile advertisement attestation using trustzone

Wenhao Li, Haibo Li, Haibo Chen, and Yubin Xia. Adat- tester: Secure online mobile advertisement attestation using trustzone. InProceedings of the 13th annual in- ternational conference on mobile systems, applications, and services, pages 75–88, 2015

work page 2015

[40] [40]

Build- ing trusted path on untrusted device drivers for mobile devices

Wenhao Li, Mingyang Ma, Jinchen Han, Yubin Xia, Binyu Zang, Cheng-Kang Chu, and Tieyan Li. Build- ing trusted path on untrusted device drivers for mobile devices. InProceedings of 5th Asia-Pacific Workshop on Systems, pages 1–7, 2014

work page 2014

[41] [41]

Large language models on mobile devices: Measurements, analysis, and insights

Xiang Li, Zhenyan Lu, Dongqi Cai, Xiao Ma, and Meng- wei Xu. Large language models on mobile devices: Measurements, analysis, and insights. InProceedings of the Workshop on Edge and Mobile Foundation Models, pages 1–6, 2024

work page 2024

[42] [42]

Robust safe reinforcement learning under adversarial disturbances

Zeyang Li, Chuxiong Hu, Shengbo Eben Li, Jia Cheng, and Yunan Wang. Robust safe reinforcement learning under adversarial disturbances. In2023 62nd IEEE 15 Conference on Decision and Control (CDC), pages 334–

work page

[43] [43]

Ferret-ui 2: Mastering universal user interface understanding across platforms.arXiv preprint arXiv:2410.18967, 2024

Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moor- thy, Jeff Nichols, Yinfei Yang, and Zhe Gan. Ferret-ui 2: Mastering universal user interface understanding across platforms.arXiv preprint arXiv:2410.18967, 2024

work page arXiv 2024

[44] [44]

OP-TEE: Open Portable Trusted Execution Environment

Linaro and Contributors. OP-TEE: Open Portable Trusted Execution Environment. GitHub repository, 2025

work page 2025

[45] [45]

Tz-kms: A secure key management service for joint cloud com- puting with arm trustzone

Shiyu Luo, Zhichao Hua, and Yubin Xia. Tz-kms: A secure key management service for joint cloud com- puting with arm trustzone. In2018 IEEE Symposium on Service-Oriented System Engineering (SOSE), pages 180–185. IEEE, 2018

work page 2018

[46] [46]

Honeycomb: Secure and efficient {GPU} executions via static valida- tion

Haohui Mai, Jiacheng Zhao, Hongren Zheng, Yiyang Zhao, Zibin Liu, Mingyu Gao, Cong Wang, Huimin Cui, Xiaobing Feng, and Christos Kozyrakis. Honeycomb: Secure and efficient {GPU} executions via static valida- tion. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 155–172, 2023

work page 2023

[47] [47]

Darknetz: towards model privacy at the edge using trusted execution environments

Fan Mo, Ali Shahin Shamsabadi, Kleomenis Katevas, Soteris Demetriou, Ilias Leontiadis, Andrea Cavallaro, and Hamed Haddadi. Darknetz: towards model privacy at the edge using trusted execution environments. In Proceedings of the 18th International Conference on Mobile Systems, Applications, and Services, pages 161– 174, 2020

work page 2020

[48] [48]

rknn-llm

mtx512. rknn-llm. https://github.com/mtx512/ rk3588-npu, 2023

work page 2023

[49] [49]

Basics on android ram management, what is(n’t) bloat? https://www.reddit.com/r/ GalaxyS8/comments/6agads/basics_on_android_ ram_management_what_isnt_bloat/, 2026

neomancr. Basics on android ram management, what is(n’t) bloat? https://www.reddit.com/r/ GalaxyS8/comments/6agads/basics_on_android_ ram_management_what_isnt_bloat/, 2026

work page 2026

[50] [50]

The ai workspace that works for you

Notion. The ai workspace that works for you. https: //www.notion.com/product/ai, 2026

work page 2026

[51] [51]

Oblivious {Multi-Party} machine learn- ing on trusted processors

Olga Ohrimenko, Felix Schuster, Cédric Fournet, Aastha Mehta, Sebastian Nowozin, Kapil Vaswani, and Manuel Costa. Oblivious {Multi-Party} machine learn- ing on trusted processors. In25th USENIX Security Sym- posium (USENIX Security 16), pages 619–636, 2016

work page 2016

[52] [52]

Safe and practical gpu computation in trustzone

Heejin Park and Felix Xiaozhu Lin. Safe and practical gpu computation in trustzone. InProceedings of the Eighteenth European Conference on Computer Systems, pages 505–520, 2023

work page 2023

[53] [53]

The ai companion who cares always here to listen and talk.https://replika.ai/, 2026

Replika. The ai companion who cares always here to listen and talk.https://replika.ai/, 2026

work page 2026

[54] [54]

Using arm trustzone to build a trusted lan- guage runtime for mobile applications

Nuno Santos, Himanshu Raj, Stefan Saroiu, and Alec Wolman. Using arm trustzone to build a trusted lan- guage runtime for mobile applications. InProceedings of the 19th international conference on Architectural support for programming languages and operating sys- tems, pages 67–80, 2014

work page 2014

[55] [55]

ennclave: Offline inference with model confidentiality

Alexander Schlögl and Rainer Böhme. ennclave: Offline inference with model confidentiality. InProceedings of the 13th ACM Workshop on Artificial Intelligence and Security, pages 93–104, 2020

work page 2020

[56] [56]

In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 723–738, 2022

Tianxiang Shen, Ji Qi, Jianyu Jiang, Xian Wang, Siyuan Wen, Xusheng Chen, Shixiong Zhao, Sen Wang, Li Chen, Xiapu Luo, et al.{SOTER}: Guarding black- box inference for general neural networks at the edge. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 723–738, 2022

work page 2022

[57] [57]

Standard Performance Evaluation Corporation (SPEC), Gainesville, V A, USA.SPEC CPU® 2017 Benchmark Suite, 2017.https://www.spec.org/cpu2017/

work page 2017

[58] [58]

Trustice: Hardware-assisted isolated comput- ing environments on mobile devices

He Sun, Kun Sun, Yuewu Wang, Jiwu Jing, and Haining Wang. Trustice: Hardware-assisted isolated comput- ing environments on mobile devices. InDependable Systems and Networks (DSN), 2015 45th Annual IEEE/I- FIP International Conference on, pages 367–378. IEEE, 2015

work page 2015

[59] [59]

Shad- ownet: A secure and efficient on-device model inference system for convolutional neural networks

Zhichuang Sun, Ruimin Sun, Changming Liu, Am- rita Roy Chowdhury, Long Lu, and Somesh Jha. Shad- ownet: A secure and efficient on-device model inference system for convolutional neural networks. In2023 IEEE Symposium on Security and Privacy (SP), pages 1596–

work page

[60] [60]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Riv- ière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[61] [61]

Graviton: Trusted execution environments on gpus

Stavros V olos, Kapil Vaswani, and Rodrigo Bruno. Graviton: Trusted execution environments on gpus. In OSDI, pages 681–696, 2018

work page 2018

[62] [62]

Tz-llm: Protecting on- device large language models with arm trustzone.arXiv preprint arXiv:2511.13717, 2025

Xunjie Wang, Jiacheng Shi, Zihan Zhao, Yang Yu, Zhichao Hua, and Jinyu Gu. Tz-llm: Protecting on- device large language models with arm trustzone.arXiv preprint arXiv:2511.13717, 2025

work page arXiv 2025

[63] [63]

Building gpu tees using cpu secure enclaves with gevi- sor

Xiaolong Wu, Dave Jing Tian, and Chung Hwan Kim. Building gpu tees using cpu secure enclaves with gevi- sor. InProceedings of the 2023 ACM Symposium on Cloud Computing, pages 249–264, 2023. 16

work page 2023

[64] [64]

Colony: A privi- leged trusted execution environment with extensibility

Yubin Xia, Zhichao Hua, Yang Yu, Jinyu Gu, Haibo Chen, Binyu Zang, and Haibing Guan. Colony: A privi- leged trusted execution environment with extensibility. IEEE Transactions on Computers, 71(2):479–492, 2021

work page 2021

[65] [65]

Aegisdnn: Dependable and timely execution of dnn tasks with sgx

Yecheng Xiang, Yidi Wang, Hyunjong Choi, Mohsen Karimi, and Hyoseung Kim. Aegisdnn: Dependable and timely execution of dnn tasks with sgx. In2021 IEEE Real-Time Systems Symposium (RTSS), pages 68–81. IEEE, 2021

work page 2021

[66] [66]

Lawformer: A pre-trained language model for chinese legal long documents.AI Open, 2:79– 84, 2021

Chaojun Xiao, Xueyu Hu, Zhiyuan Liu, Cunchao Tu, and Maosong Sun. Lawformer: A pre-trained language model for chinese legal long documents.AI Open, 2:79– 84, 2021

work page 2021

[67] [67]

PIXIU: A large language model, instruction data and evaluation benchmark for finance.arXiv preprint arXiv:2306.05443, 2023

Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. Pixiu: A large language model, instruction data and evaluation benchmark for finance.arXiv preprint arXiv:2306.05443, 2023

work page arXiv 2023

[68] [68]

On-device language models: A comprehensive review.arXiv preprint arXiv:2409.00088,

Jiajun Xu, Zhiyuan Li, Wei Chen, Qun Wang, Xin Gao, Qi Cai, and Ziyuan Ling. On-device language models: A comprehensive review.arXiv preprint arXiv:2409.00088, 2024

work page arXiv 2024

[69] [69]

Ui-ug: A unified mllm for ui understanding and generation.arXiv preprint arXiv:2509.24361, 2025

Hao Yang, Weijie Qiu, Ru Zhang, Zhou Fang, Ruichao Mao, Xiaoyu Lin, Maji Huang, Zhaosong Huang, Teng Guo, Shuoyang Liu, et al. Ui-ug: A unified mllm for ui understanding and generation.arXiv preprint arXiv:2509.24361, 2025

work page arXiv 2025

[70] [70]

Penetralium: Privacy- preserving and memory-efficient neural network infer- ence at the edge.Future Generation Computer Systems, 156:30–41, 2024

Mengda Yang, Wenzhe Yi, Juan Wang, Hongxin Hu, Xiaoyang Xu, and Ziang Li. Penetralium: Privacy- preserving and memory-efficient neural network infer- ence at the edge.Future Generation Computer Systems, 156:30–41, 2024

work page 2024

[71] [71]

rknn-llm

yhcvb. rknn-llm. https://github.com/airockchip/ rknn-llm, 2025

work page 2025

[72] [72]

rknpu-driver

yhcvb. rknpu-driver. https://github.com/ airockchip/rknn-llm/tree/main/rknpu-driver, 2025

work page 2025

[73] [73]

Babyagi.https://babyagi.org/, 2026

Yohei. Babyagi.https://babyagi.org/, 2026

work page 2026

[74] [74]

Ferret-ui: Grounded mobile ui understand- ing with multimodal llms

Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. Ferret-ui: Grounded mobile ui understand- ing with multimodal llms. InEuropean Conference on Computer Vision, pages 240–255. Springer, 2024

work page 2024

[75] [75]

arXiv:2509.00531 [cs.MA] https://arxiv.org/abs/2509.00531

Cheng Zhang, Erhu Feng, Xi Zhao, Yisheng Zhao, Wangbo Gong, Jiahui Sun, Dong Du, Zhichao Hua, Yubin Xia, and Haibo Chen. Mobiagent: A system- atic framework for customizable mobile agents.arXiv preprint arXiv:2509.00531, 2025

work page arXiv 2025

[76] [76]

You only look at screens: Multimodal chain-of-action agents

Zhuosheng Zhang and Aston Zhang. You only look at screens: Multimodal chain-of-action agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 3132–3149, 2024

work page 2024

[77] [77]

Enabling rack-scale confidential computing using heterogeneous trusted execution environment

Jianping Zhu, Rui Hou, XiaoFeng Wang, Wenhao Wang, Jiangfeng Cao, Boyan Zhao, Zhongpu Wang, Yuhui Zhang, Jiameng Ying, Lixin Zhang, et al. Enabling rack-scale confidential computing using heterogeneous trusted execution environment. In2020 IEEE Sympo- sium on Security and Privacy (SP), pages 1450–1465. IEEE, 2020. 17

work page 2020