pith. sign in

arxiv: 2605.10380 · v1 · submitted 2026-05-11 · 💻 cs.AI

Agent-X: Full Pipeline Acceleration of On-device AI Agents

Pith reviewed 2026-05-12 04:00 UTC · model grok-4.3

classification 💻 cs.AI
keywords on-device AI agentsLLM accelerationprefix cachingspeculative decodingedge latencyagentic workloadsprefill optimizationdecode acceleration
0
0 comments X

The pith

Prompt rewriting for prefix caching and LLM-free speculative decoding accelerate on-device AI agents by 1.61x end-to-end with no accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM-based agents achieve strong task performance yet face high end-to-end latency on edge devices because both the prefill and decode phases consume significant time. Agent-X supplies a software-only framework that rewrites prompts to exploit prefix caching matched to agent-specific input patterns and adds LLM-free speculative decoding to generate tokens faster with little added cost. On representative agentic workloads this yields a measured 1.61 times speedup in real hardware while accuracy stays identical. The method plugs directly into existing on-device agents and removes the primary latency bottlenecks across the full pipeline. If the gains hold, agentic applications become practical on phones and other constrained devices without hardware redesign or accuracy penalties.

Core claim

Agent-X accelerates both prefill and decode stages of on-device agent workloads through prompt rewriting that enables prefix caching on agent input-token patterns together with LLM-free speculative decoding that produces tokens rapidly at minimal overhead, delivering 1.61x end-to-end speedup on representative workloads with no accuracy loss and seamless integration into existing agents.

What carries the argument

Prompt rewriting tailored to agent-specific input-token patterns for prefix caching combined with LLM-free speculative decoding for token generation.

Load-bearing premise

The selected representative agentic workloads capture real-world on-device usage patterns and the two optimizations generalize across models and tasks without introducing hidden accuracy or latency trade-offs.

What would settle it

Running the same end-to-end latency and accuracy measurements on a broader collection of agentic tasks or additional LLM models and observing either no speedup or any accuracy drop would disprove the central claim.

Figures

Figures reproduced from arXiv: 2605.10380 by Byeongjun Shin, Jiin Kim, Jinha Chung, Minsoo Rhu.

Figure 1
Figure 1. Figure 1: Overview of an agentic system. due to the resource-constrained nature of edge computing. Unlike cloud-based LLMs whose primary performance bottleneck lies in the decode stage, this paper makes the key observation that on￾device agents spend a significant amount of time in both the prefill and decode stages. This key insight underscores the need for full￾system acceleration techniques that address all key c… view at source ↗
Figure 2
Figure 2. Figure 2: Structure of plan-out agents. The whole pipeline consists of two LLMs (Planner and Arbiter) and a series of tool calls [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of applying prefix caching. Even though [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the ToolRAG process. TinyAgent [19] is an open-source macOS agent that fine-tunes LLMs for agentic tasks. It also introduces ToolRAG (Tool Retrieval Augmented Generation) [69] for efficient tool and example selec￾tion. Tool choice and tool-use examples. ToolRAG comprises three components: (i) offline preparation of a tool-use example database, (ii) runtime tool retrieval, and (iii) tool-use… view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of speculative decoding. 0 10 20 30 40 Latency (seconds) Planner prefill Planner decode Arbiter prefill Arbiter decode Non-LLM [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Latency breakdown of agentic tasks. tokens refer to portions of prompts that prefix caching can be ap￾plied to with a single static prompt, and uncacheable tokens refer to portions of prompts that cannot benefit from prefix caching due to an early token mismatch. Speculative decoding.An LLM’s decode stage is memory bandwidth￾bound because of its autoregressive, one-token-at-a-time nature, yielding low comp… view at source ↗
Figure 8
Figure 8. Figure 8: Tool co-activation heatmap of a subset of tools in [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: Overview of PromptWeaver, divided into the of [PITH_FULL_IMAGE:figures/full_fig_p006_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Illustration of PromptWeaver’s offline KV cache precompute mechanism. [PITH_FULL_IMAGE:figures/full_fig_p007_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Overview of ExSpec with trigram (𝑛=3) LUT and draft token generation length of 4. append single-tool examples1 of activated tools, which we observed to be the most popular form of tool-use examples, to the end of the clustered examples. We also add top-𝐾 (0 ≤ 𝐾 ≤ 4) relevant examples from ToolRAG ( [PITH_FULL_IMAGE:figures/full_fig_p008_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Tool-use example coverage (left, red) and total [PITH_FULL_IMAGE:figures/full_fig_p009_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prefill stage latency (left) and speedup gain (right) [PITH_FULL_IMAGE:figures/full_fig_p010_16.png] view at source ↗
Figure 18
Figure 18. Figure 18: End-to-end latency (left) and speedup (right) of [PITH_FULL_IMAGE:figures/full_fig_p010_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Decode latency (left) and speedup (right) of ExSpec [PITH_FULL_IMAGE:figures/full_fig_p011_19.png] view at source ↗
read the original abstract

LLM-based agents deliver state-of-the-art performance across tasks but incur high end-to-end latency on edge devices. We introduce Agent-X, a software-only, accuracy-preserving framework that accelerates both the prefill and decode stages of on-device agent workloads. Agent-X's two key components rewrite prompts to leverage prefix caching tailored to agent-specific input-token patterns and enable LLM-free speculative decoding for fast token generation with minimal overhead. On representative agentic workloads, Agent-X achieves a 1.61x end-to-end speedup in real systems with no accuracy loss and can be seamlessly integrated into existing on-device AI agents. To the best of our knowledge, ours is the first to systematically characterize and eliminate latency bottlenecks in on-device agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces Agent-X, a software-only, accuracy-preserving framework for accelerating on-device LLM-based agents. Its two core techniques are prompt rewriting to exploit prefix caching on agent-specific token patterns and LLM-free speculative decoding for low-overhead token generation. The central empirical claim is a 1.61x end-to-end real-system speedup on representative agentic workloads with zero accuracy loss, plus seamless integration into existing agents; the work positions itself as the first systematic characterization and elimination of prefill and decode latency bottlenecks in on-device agents.

Significance. If the reported speedup and accuracy preservation hold under broader conditions, the contribution would be practically significant for edge deployment of agentic systems, where end-to-end latency is a primary barrier. The software-only design and focus on both prefill and decode stages are strengths that could enable immediate adoption without hardware changes. The absence of parameter fitting or circular derivations in the performance numbers is a positive feature of the empirical approach.

major comments (1)
  1. [Abstract and Evaluation] Abstract and Evaluation section: the 1.61x end-to-end speedup claim is presented as the primary result, yet the manuscript supplies no concrete description of the 'representative agentic workloads,' their selection criteria, diversity (e.g., tool-call frequency, context length distribution, multi-turn structure), exact baselines, measurement methodology (including hardware, batching, and error bars), or ablation isolating the contribution of each component. This information is load-bearing for assessing whether the speedup generalizes beyond the tested cases.
minor comments (1)
  1. [Abstract] The abstract states the techniques 'can be seamlessly integrated' but provides no concrete integration examples or overhead measurements for the prompt-rewriting and speculative-decoding modules themselves.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful comments. We address the concern about the evaluation details below and will update the manuscript to provide the requested information.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: the 1.61x end-to-end speedup claim is presented as the primary result, yet the manuscript supplies no concrete description of the 'representative agentic workloads,' their selection criteria, diversity (e.g., tool-call frequency, context length distribution, multi-turn structure), exact baselines, measurement methodology (including hardware, batching, and error bars), or ablation isolating the contribution of each component. This information is load-bearing for assessing whether the speedup generalizes beyond the tested cases.

    Authors: We acknowledge that the current manuscript could benefit from more explicit details on the workloads and evaluation methodology to strengthen the claims. In the revised version, we will add a dedicated subsection in the Evaluation section describing the representative agentic workloads in detail, including their selection criteria (e.g., based on popular on-device agent benchmarks and real-world scenarios), diversity aspects such as varying tool-call frequencies, context lengths, and multi-turn conversation structures. We will also specify the exact baselines (vanilla inference on the same hardware), the measurement setup including the target hardware (specific edge device), batch size (typically 1 for on-device), number of runs for error bars, and provide ablations showing the individual and combined contributions of prompt rewriting for prefix caching and LLM-free speculative decoding. This will allow readers to better assess the generalizability of the 1.61x speedup with no accuracy loss. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical systems evaluation with no derivations

full rationale

The paper presents Agent-X as an empirical software framework for accelerating on-device LLM agents via prompt rewriting for prefix caching and LLM-free speculative decoding. All central claims (1.61x end-to-end speedup, no accuracy loss) are stated as direct measurements on real systems using representative workloads. No equations, parameter fittings, uniqueness theorems, or derivation chains appear in the abstract or described content. The work contains no self-referential definitions, fitted inputs renamed as predictions, or load-bearing self-citations that reduce claims to their own inputs. This is a standard empirical systems paper whose results stand or fall on experimental data rather than any closed logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on the abstract, the framework relies on standard LLM inference stages (prefill/decode) and existing caching mechanisms; no explicit free parameters, new axioms, or invented entities are introduced or quantified.

axioms (1)
  • domain assumption Agent workloads exhibit reusable prefix token patterns amenable to caching
    The prompt-rewriting component assumes such patterns exist and can be leveraged without side effects.

pith-pipeline@v0.9.0 · 5420 in / 1196 out tokens · 64411 ms · 2026-05-12T04:00:40.925350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages

  1. [1]

    Gulavani, Alexey Tumanov, and Ramachandran Ramjee

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-latency Tradeoff in LLM Inference with Sarathi-serve. InProceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI)

  2. [2]

    Gulavani, and Ramachandran Ramjee

    Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. Inarxiv.org

  3. [3]

    Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Belenko, S Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2024. LLM in a Flash: Efficient Large Language Model Inference with Limited Memory. InProceedings of the ACL (Association for Computational Linguistics)

  4. [4]

    AMD. 2025. https://www.amd.com/en/products/processors/laptop/ryzen.html

  5. [5]

    AMD. 2025. AMD Instinct MI325X Accelerator. https://www.amd.com/content/ dam/amd/en/documents/instinct-tech-docs/product-briefs/instinct-mi325x- datasheet.pdf

  6. [6]

    Anthropic. 2024. https://www.anthropic.com/solutions/agents

  7. [7]

    Anthropic. 2024. Introducing the Model Context Protocol. https://www. anthropic.com/news/model-context-protocol

  8. [8]

    Apoorv Saxena. 2023. Prompt Lookup Decoding. https://github.com/ apoorvumang/prompt-lookup-decoding

  9. [9]

    Apple. 2010. Siri. https://www.apple.com/siri

  10. [10]

    Apple. 2023. https://github.com/ml-explore/mlx-lm

  11. [11]

    Apple. 2024. Apple Intelligence. https://www.apple.com/apple-intelligence

  12. [12]

    Apple. 2024. Apple Introduces M4 Pro and M4 Max. https://www.apple.com/ newsroom/2024/10/apple-introduces-m4-pro-and-m4-max

  13. [13]

    Apple. 2024. Introducing Apple’s On-device and Server Foundation Models. https: //machinelearning.apple.com/research/introducing-apple-foundation-models

  14. [14]

    Apple Developer. 2025. https://developer.apple.com/documentation/ FoundationModels

  15. [15]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  16. [16]

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Lau- rent Sifre, and John Jumper. 2023. Accelerating Large Language Model Decoding with Speculative Sampling. Inarxiv.org

  17. [17]

    Stanley F Chen and Joshua Goodman. 1999. An Empirical Study of Smoothing Techniques for Language Modeling.Computer Speech & Language13, 4 (1999), 359–394

  18. [18]

    Yongheng Deng, Ziqing Qiao, Ye Zhang, Zhenya Ma, Yang Liu, and Ju Ren

  19. [19]

    InProceedings of the International Conference on Mobile Systems, Applications, and Services

    CrossLM: A Data-free Collaborative Fine-tuning Framework for Large and Small Language Models. InProceedings of the International Conference on Mobile Systems, Applications, and Services

  20. [20]

    Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Hooper, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. 2024. TinyAgent: Function Calling at the Edge. Inarxiv.org

  21. [21]

    Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. 2025. Plan-and-Act: Improving Planning of Agents for Long-horizon Tasks. Inarxiv.org

  22. [22]

    Tiantian Gan and Qiyao Sun. 2025. RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-augmented Generation. Inarxiv.org

  23. [23]

    Gemini Team. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. In arxiv.org

  24. [24]

    In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-latency Inference. InProceedings of Machine Learning and Systems (MLSYS)

  25. [25]

    Google. 2024. https://gemini.google/assistant

  26. [26]

    Google Blog. 2024. Circle (or Highlight or Scribble) to Search. https://blog. google/products/search/google-circle-to-search-android

  27. [27]

    Google Blog. 2025. A New Era of Intelligence with Gemini 3. https://blog.google/ products/gemini/gemini-3/

  28. [28]

    Google Blog. 2025. Gemini CLI: Your Open-source AI Agent. https://blog.google/ technology/developers/introducing-gemini-cli-open-source-ai-agent/

  29. [29]

    Google Cloud. 2024. TPU v6e. https://cloud.google.com/tpu/docs/v6e

  30. [30]

    Google DeepMind. 2023. https://deepmind.google/models/gemini/nano

  31. [31]

    Google Developers. 2025. https://ai.google.dev/gemini-api/docs/function-calling

  32. [32]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

  33. [33]

    Yufeng Gu, Alireza Khadem, Sumanth Umesh, Ning Liang, Xavier Servot, Onur Mutlu, Ravi Iyer, and Reetuparna Das. 2025. PIM is All You Need: A CXL-enabled GPU-free System for Large Language Model Inference. InProceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

  34. [34]

    Awni Hannun, Jagrit Digani, Angelos Katharopoulos, and Ronan Collobert. 2023. https://github.com/ml-explore/mlx

  35. [35]

    Mingqiang Huang, Ao Shen, Kai Li, Haoxiang Peng, Boyu Li, Yupeng Su, and Hao Yu. 2025. EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models.IEEE Transactions on Circuits and Systems I: Regular Papers(2025)

  36. [36]

    Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, and Ashish Panwar

    Aditya K. Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, and Ashish Panwar. 2025. POD-attention: Unlocking Full Prefill-decode Overlap for Faster LLM Inference. InProceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

  37. [37]

    Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. 2024. An LLM Compiler for Parallel Func- tion Calling. InProceedings of the International Conference on Machine Learning (ICML)

  38. [38]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM Symposium on Operating System Principles (SOSP)

  39. [39]

    Daniel D Lee and H Sebastian Seung. 1999. Learning the Parts of Objects by Non-negative Matrix Factorization.Nature401, 6755 (1999), 788–791

  40. [40]

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast Inference from Transformers via Speculative Decoding. InProceedings of the International Con- ference on Machine Learning (ICML)

  41. [41]

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. InProceedings of the International Conference on Machine Learning (ICML)

  42. [42]

    Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2021. What Makes Good In-Context Examples for GPT-3?. In arxiv.org

  43. [43]

    Shiyi Liu, Haiying Shen, Shuai Che, Mahdi Ghandi, and Mingqin Li. 2025. HERA: Hybrid Edge-cloud Resource Allocation for Cost-efficient AI Agents. Inarxiv.org

  44. [44]

    LM Studio. 2024. https://github.com/lmstudio-ai/mlx-engine

  45. [45]

    Manus AI. 2025. Manus. https://manus.im/

  46. [46]

    Meta. 2024. Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customiz- able Models. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge- mobile-devices

  47. [47]

    Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. 2024. SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification. InProceedings ...

  48. [48]

    Microsoft. 2024. https://www.microsoft.com/en-us/windows/business/devices/ copilot-plus-pcs

  49. [49]

    Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2022. Met- alCL: Learning to Learn in Context. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)

  50. [50]

    NVIDIA. 2024. NVIDIA H100 Tensor Core GPU. https://resources.nvidia.com/en- us-hopper-architecture/nvidia-tensor-core-gpu-datasheet

  51. [51]

    NVIDIA. 2024. NVIDIA H200 Tensor Core GPU. https://nvdam.widen.net/s/ nb5zzzsjdf/hpc-datasheet-sc23-h200-datasheet-3002446

  52. [52]

    NVIDIA. 2025. NVIDIA Blackwell Architecture Technical Brief. https://resources. nvidia.com/en-us-blackwell-architecture

  53. [53]

    OpenAI. 2025. Introducing ChatGPT Agent: Bridging Research and Action. https://openai.com/index/introducing-chatgpt-agent/

  54. [54]

    OpenAI. 2025. Introducing Deep Research. https://openai.com/index/ introducing-deep-research/

  55. [55]

    Varatheepan Paramanayakam, Andreas Karatzas, Iraklis Anagnostopoulos, and Dimitrios Stamoulis. 2025. Less is More: Optimizing Function Calling for LLM Execution on Edge Devices. InProceedings of the Design, Automation and Test in Europe Conference (DATE)

  56. [56]

    Yeonhong Park, Jake Hyun, Hojoon Kim, and Jae W. Lee. 2025. DecDEC: A Systems Approach to Advancing Low-bit LLM Quantization. InProceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI)

  57. [57]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Infer- ence Using Phase Splitting. InProceedings of the International Symposium on Computer Architecture (ISCA)

  58. [58]

    Qualcomm. 2024. https://www.qualcomm.com/products/mobile/snapdragon/ laptops-and-tablets/snapdragon-x-elite

  59. [59]

    Qualcomm. 2024. Hexagon NPU SDK. https://www.qualcomm.com/developer/ software/hexagon-npu-sdk

  60. [60]

    Samsung. 2017. https://www.samsung.com/us/apps/bixby

  61. [61]

    Rishov Sarkar, Hanxue Liang, Zhiwen Fan, Zhangyang Wang, and Cong Hao

  62. [62]

    InProceedings of the International Conference on Computer-Aided Design

    Edge-MoE: Memory-efficient Multi-task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts. InProceedings of the International Conference on Computer-Aided Design

  63. [63]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InProceed- ings of the International Conference on Neural Information Processing Systems (NeurIPS)

  64. [64]

    Seong Hoon Seo, Junghoon Kim, Donghyun Lee, Seonah Yoo, Seokwon Moon, Yeonhong Park, and Jae W. Lee. 2025. FACIL: Flexible DRAM Address Map- ping for SoC-PIM Cooperative On-device LLM Inference. InProceedings of the International Symposium on High-Performance Computer Architecture (HPCA)

  65. [65]

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. InProceedings of the International Conference on Neural Information Processing Systems (NeurIPS)

  66. [66]

    Zheyu Shen, Yexiao He, Ziyao Wang, Yuning Zhang, Guoheng Sun, Wanghao Ye, and Ang Li. 2025. EdgeLoRA: An Efficient Multi-tenant LLM Serving System on Edge Devices. InProceedings of the International Conference on Mobile Systems, Jinha Chung, Byeongjun Shin, Jiin Kim, and Minsoo Rhu Applications, and Services

  67. [67]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InProceedings of the International Conference on Neural Information Processing Systems (NeurIPS)

  68. [68]

    Simranjit Singh, Andreas Karatzas, Michael Fore, Iraklis Anagnostopoulos, and Dimitrios Stamoulis. 2024. An LLM-tool Compiler for Fused Parallel Function Calling. Inarxiv.org

  69. [69]

    Squeeze AI Lab. 2024. TinyAgent-7B. https://huggingface.co/squeeze-ai-lab/ TinyAgent-7B

  70. [70]

    Squeeze AI Lab. 2024. TinyAgent-dataset. https://huggingface.co/datasets/ squeeze-ai-lab/TinyAgent-dataset

  71. [71]

    Squeeze AI Lab. 2024. TinyAgent-ToolRAG. https://huggingface.co/squeeze-ai- lab/TinyAgent-ToolRAG

  72. [72]

    Chunlin Tian, Xinpeng Qin, Kahou Tam, Li Li, Zijian Wang, Yuanzhe Zhao, Minglei Zhang, and Chengzhong Xu. 2025. CLONE: Customizing LLMs for Efficient Latency-aware Inference at the Edge. InProceedings of the USENIX Annual Technical Conference (ATC)

  73. [73]

    Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Gaurav Jain, Oren Pereg, Moshe Wasserblat, and David Harel. 2025. Accelerating LLM In- ference with Lossless Speculative Decoding Algorithms for Heterogeneous Vo- cabularies. InProceedings of the International Conference on Machine Learning (ICML)

  74. [74]

    Haoming Wang, Boyuan Yang, Xiangyu Yin, and Wei Gao. 2025. Never Start from Scratch: Expediting On-device LLM Personalization via Explainable Model Selec- tion. InProceedings of the International Conference on Mobile Systems, Applications, and Services

  75. [75]

    Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-agent Collaboration. InProceedings of the International Conference on Neural Information Processing Systems (NeurIPS)

  76. [76]

    Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent: Autonomous Multi-modal Mobile Device Agent with Visual Perception. Inarxiv.org

  77. [77]

    WizardLM Team. 2024. WizardLM 2. https://wizardlm.github.io/WizardLM2

  78. [78]

    Zheng Xu, Dehao Kong, Jiaxin Liu, Jinxi Li, Jingxiang Hou, Xu Dai, Chao Li, Shaojun Wei, Yang Hu, and Shouyi Yin. 2025. WSC-LLM: Efficient LLM Service and Architecture Co-exploration for Wafer-scale Chips. InProceedings of the International Symposium on Computer Architecture (ISCA)

  79. [79]

    Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion. InProceedings of the European Conference on Computer Systems (EuroSys)

  80. [80]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing Reasoning and Acting in Language Models. InProceedings of the International Conference on Learning Representations (ICLR)

Showing first 80 references.