Agent-X: Full Pipeline Acceleration of On-device AI Agents

Byeongjun Shin; Jiin Kim; Jinha Chung; Minsoo Rhu

arxiv: 2605.10380 · v1 · submitted 2026-05-11 · 💻 cs.AI

Agent-X: Full Pipeline Acceleration of On-device AI Agents

Jinha Chung , Byeongjun Shin , Jiin Kim , Minsoo Rhu This is my paper

Pith reviewed 2026-05-12 04:00 UTC · model grok-4.3

classification 💻 cs.AI

keywords on-device AI agentsLLM accelerationprefix cachingspeculative decodingedge latencyagentic workloadsprefill optimizationdecode acceleration

0 comments

The pith

Prompt rewriting for prefix caching and LLM-free speculative decoding accelerate on-device AI agents by 1.61x end-to-end with no accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM-based agents achieve strong task performance yet face high end-to-end latency on edge devices because both the prefill and decode phases consume significant time. Agent-X supplies a software-only framework that rewrites prompts to exploit prefix caching matched to agent-specific input patterns and adds LLM-free speculative decoding to generate tokens faster with little added cost. On representative agentic workloads this yields a measured 1.61 times speedup in real hardware while accuracy stays identical. The method plugs directly into existing on-device agents and removes the primary latency bottlenecks across the full pipeline. If the gains hold, agentic applications become practical on phones and other constrained devices without hardware redesign or accuracy penalties.

Core claim

Agent-X accelerates both prefill and decode stages of on-device agent workloads through prompt rewriting that enables prefix caching on agent input-token patterns together with LLM-free speculative decoding that produces tokens rapidly at minimal overhead, delivering 1.61x end-to-end speedup on representative workloads with no accuracy loss and seamless integration into existing agents.

What carries the argument

Prompt rewriting tailored to agent-specific input-token patterns for prefix caching combined with LLM-free speculative decoding for token generation.

Load-bearing premise

The selected representative agentic workloads capture real-world on-device usage patterns and the two optimizations generalize across models and tasks without introducing hidden accuracy or latency trade-offs.

What would settle it

Running the same end-to-end latency and accuracy measurements on a broader collection of agentic tasks or additional LLM models and observing either no speedup or any accuracy drop would disprove the central claim.

Figures

Figures reproduced from arXiv: 2605.10380 by Byeongjun Shin, Jiin Kim, Jinha Chung, Minsoo Rhu.

**Figure 1.** Figure 1: Overview of an agentic system. due to the resource-constrained nature of edge computing. Unlike cloud-based LLMs whose primary performance bottleneck lies in the decode stage, this paper makes the key observation that ondevice agents spend a significant amount of time in both the prefill and decode stages. This key insight underscores the need for fullsystem acceleration techniques that address all key c… view at source ↗

**Figure 2.** Figure 2: Structure of plan-out agents. The whole pipeline consists of two LLMs (Planner and Arbiter) and a series of tool calls [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 4.** Figure 4: Illustration of applying prefix caching. Even though [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the ToolRAG process. TinyAgent [19] is an open-source macOS agent that fine-tunes LLMs for agentic tasks. It also introduces ToolRAG (Tool Retrieval Augmented Generation) [69] for efficient tool and example selection. Tool choice and tool-use examples. ToolRAG comprises three components: (i) offline preparation of a tool-use example database, (ii) runtime tool retrieval, and (iii) tool-use… view at source ↗

**Figure 5.** Figure 5: Illustration of speculative decoding. 0 10 20 30 40 Latency (seconds) Planner prefill Planner decode Arbiter prefill Arbiter decode Non-LLM [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Latency breakdown of agentic tasks. tokens refer to portions of prompts that prefix caching can be applied to with a single static prompt, and uncacheable tokens refer to portions of prompts that cannot benefit from prefix caching due to an early token mismatch. Speculative decoding.An LLM’s decode stage is memory bandwidthbound because of its autoregressive, one-token-at-a-time nature, yielding low comp… view at source ↗

**Figure 8.** Figure 8: Tool co-activation heatmap of a subset of tools in [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗

**Figure 11.** Figure 11: Overview of PromptWeaver, divided into the of [PITH_FULL_IMAGE:figures/full_fig_p006_11.png] view at source ↗

**Figure 12.** Figure 12: Illustration of PromptWeaver’s offline KV cache precompute mechanism. [PITH_FULL_IMAGE:figures/full_fig_p007_12.png] view at source ↗

**Figure 13.** Figure 13: Overview of ExSpec with trigram (𝑛=3) LUT and draft token generation length of 4. append single-tool examples1 of activated tools, which we observed to be the most popular form of tool-use examples, to the end of the clustered examples. We also add top-𝐾 (0 ≤ 𝐾 ≤ 4) relevant examples from ToolRAG ( [PITH_FULL_IMAGE:figures/full_fig_p008_13.png] view at source ↗

**Figure 15.** Figure 15: Tool-use example coverage (left, red) and total [PITH_FULL_IMAGE:figures/full_fig_p009_15.png] view at source ↗

**Figure 16.** Figure 16: Prefill stage latency (left) and speedup gain (right) [PITH_FULL_IMAGE:figures/full_fig_p010_16.png] view at source ↗

**Figure 18.** Figure 18: End-to-end latency (left) and speedup (right) of [PITH_FULL_IMAGE:figures/full_fig_p010_18.png] view at source ↗

**Figure 19.** Figure 19: Decode latency (left) and speedup (right) of ExSpec [PITH_FULL_IMAGE:figures/full_fig_p011_19.png] view at source ↗

read the original abstract

LLM-based agents deliver state-of-the-art performance across tasks but incur high end-to-end latency on edge devices. We introduce Agent-X, a software-only, accuracy-preserving framework that accelerates both the prefill and decode stages of on-device agent workloads. Agent-X's two key components rewrite prompts to leverage prefix caching tailored to agent-specific input-token patterns and enable LLM-free speculative decoding for fast token generation with minimal overhead. On representative agentic workloads, Agent-X achieves a 1.61x end-to-end speedup in real systems with no accuracy loss and can be seamlessly integrated into existing on-device AI agents. To the best of our knowledge, ours is the first to systematically characterize and eliminate latency bottlenecks in on-device agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Agent-X combines prompt rewriting for prefix caching with LLM-free speculative decoding to claim a 1.61x real-system speedup on on-device agents, but the result hinges on uncharacterized workloads that may not reflect typical usage.

read the letter

The main thing to know is that this paper gives a practical, software-only way to cut latency in on-device LLM agents by rewriting prompts to hit prefix caches more often and using speculative decoding that skips the LLM for draft tokens. They report 1.61x end-to-end speedup with zero accuracy loss on their test cases, and the approach looks easy to drop into existing agent code. That combination, tuned to agent token patterns like tool calls and multi-turn context, is the concrete new piece; prior work had the individual tricks but not this agent-focused pipeline treatment plus the latency breakdown for prefill and decode stages on edge hardware. They earn credit for staying grounded in real-system runs instead of simulation and for keeping the changes non-invasive. The soft spot is the evaluation. The speedup rests on unspecified representative agentic workloads, and the abstract gives no selection criteria, diversity stats, or ablations across models, context lengths, or task mixes. If those workloads over-sample cases with high cache reuse or high speculation acceptance, the 1.61x number will not travel. The stress-test note on unverified coverage is on target here; without those details the central claim stays hard to assess. This paper is for systems researchers and engineers who deploy agents on phones or embedded devices and need immediate latency wins. A reader already working on mobile inference or agent runtimes will pick up usable implementation ideas even if the exact speedup needs verification. It deserves a serious referee to demand the workload definitions, baseline comparisons, and robustness checks. I would send it to peer review with those requests rather than desk reject.

Referee Report

1 major / 1 minor

Summary. The paper introduces Agent-X, a software-only, accuracy-preserving framework for accelerating on-device LLM-based agents. Its two core techniques are prompt rewriting to exploit prefix caching on agent-specific token patterns and LLM-free speculative decoding for low-overhead token generation. The central empirical claim is a 1.61x end-to-end real-system speedup on representative agentic workloads with zero accuracy loss, plus seamless integration into existing agents; the work positions itself as the first systematic characterization and elimination of prefill and decode latency bottlenecks in on-device agents.

Significance. If the reported speedup and accuracy preservation hold under broader conditions, the contribution would be practically significant for edge deployment of agentic systems, where end-to-end latency is a primary barrier. The software-only design and focus on both prefill and decode stages are strengths that could enable immediate adoption without hardware changes. The absence of parameter fitting or circular derivations in the performance numbers is a positive feature of the empirical approach.

major comments (1)

[Abstract and Evaluation] Abstract and Evaluation section: the 1.61x end-to-end speedup claim is presented as the primary result, yet the manuscript supplies no concrete description of the 'representative agentic workloads,' their selection criteria, diversity (e.g., tool-call frequency, context length distribution, multi-turn structure), exact baselines, measurement methodology (including hardware, batching, and error bars), or ablation isolating the contribution of each component. This information is load-bearing for assessing whether the speedup generalizes beyond the tested cases.

minor comments (1)

[Abstract] The abstract states the techniques 'can be seamlessly integrated' but provides no concrete integration examples or overhead measurements for the prompt-rewriting and speculative-decoding modules themselves.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful comments. We address the concern about the evaluation details below and will update the manuscript to provide the requested information.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: the 1.61x end-to-end speedup claim is presented as the primary result, yet the manuscript supplies no concrete description of the 'representative agentic workloads,' their selection criteria, diversity (e.g., tool-call frequency, context length distribution, multi-turn structure), exact baselines, measurement methodology (including hardware, batching, and error bars), or ablation isolating the contribution of each component. This information is load-bearing for assessing whether the speedup generalizes beyond the tested cases.

Authors: We acknowledge that the current manuscript could benefit from more explicit details on the workloads and evaluation methodology to strengthen the claims. In the revised version, we will add a dedicated subsection in the Evaluation section describing the representative agentic workloads in detail, including their selection criteria (e.g., based on popular on-device agent benchmarks and real-world scenarios), diversity aspects such as varying tool-call frequencies, context lengths, and multi-turn conversation structures. We will also specify the exact baselines (vanilla inference on the same hardware), the measurement setup including the target hardware (specific edge device), batch size (typically 1 for on-device), number of runs for error bars, and provide ablations showing the individual and combined contributions of prompt rewriting for prefix caching and LLM-free speculative decoding. This will allow readers to better assess the generalizability of the 1.61x speedup with no accuracy loss. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical systems evaluation with no derivations

full rationale

The paper presents Agent-X as an empirical software framework for accelerating on-device LLM agents via prompt rewriting for prefix caching and LLM-free speculative decoding. All central claims (1.61x end-to-end speedup, no accuracy loss) are stated as direct measurements on real systems using representative workloads. No equations, parameter fittings, uniqueness theorems, or derivation chains appear in the abstract or described content. The work contains no self-referential definitions, fitted inputs renamed as predictions, or load-bearing self-citations that reduce claims to their own inputs. This is a standard empirical systems paper whose results stand or fall on experimental data rather than any closed logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on the abstract, the framework relies on standard LLM inference stages (prefill/decode) and existing caching mechanisms; no explicit free parameters, new axioms, or invented entities are introduced or quantified.

axioms (1)

domain assumption Agent workloads exhibit reusable prefix token patterns amenable to caching
The prompt-rewriting component assumes such patterns exist and can be leveraged without side effects.

pith-pipeline@v0.9.0 · 5420 in / 1196 out tokens · 64411 ms · 2026-05-12T04:00:40.925350+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PromptWeaver reconstructs the input prompt to enable efficient prefix caching... ExSpec introduces a lightweight, prompt-aware draft model that enables efficient speculative decoding at the edge.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use 1,022 examples from the TinyAgent fine-tuning test dataset... tool co-activation locality... NMF clustering

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages

[1]

Gulavani, Alexey Tumanov, and Ramachandran Ramjee

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-latency Tradeoff in LLM Inference with Sarathi-serve. InProceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI)

work page 2024
[2]

Gulavani, and Ramachandran Ramjee

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. Inarxiv.org

work page 2023
[3]

Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Belenko, S Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2024. LLM in a Flash: Efficient Large Language Model Inference with Limited Memory. InProceedings of the ACL (Association for Computational Linguistics)

work page 2024
[4]

AMD. 2025. https://www.amd.com/en/products/processors/laptop/ryzen.html

work page 2025
[5]

AMD. 2025. AMD Instinct MI325X Accelerator. https://www.amd.com/content/ dam/amd/en/documents/instinct-tech-docs/product-briefs/instinct-mi325x- datasheet.pdf

work page 2025
[6]

Anthropic. 2024. https://www.anthropic.com/solutions/agents

work page 2024
[7]

Anthropic. 2024. Introducing the Model Context Protocol. https://www. anthropic.com/news/model-context-protocol

work page 2024
[8]

Apoorv Saxena. 2023. Prompt Lookup Decoding. https://github.com/ apoorvumang/prompt-lookup-decoding

work page 2023
[9]

Apple. 2010. Siri. https://www.apple.com/siri

work page 2010
[10]

Apple. 2023. https://github.com/ml-explore/mlx-lm

work page 2023
[11]

Apple. 2024. Apple Intelligence. https://www.apple.com/apple-intelligence

work page 2024
[12]

Apple. 2024. Apple Introduces M4 Pro and M4 Max. https://www.apple.com/ newsroom/2024/10/apple-introduces-m4-pro-and-m4-max

work page 2024
[13]

Apple. 2024. Introducing Apple’s On-device and Server Foundation Models. https: //machinelearning.apple.com/research/introducing-apple-foundation-models

work page 2024
[14]

Apple Developer. 2025. https://developer.apple.com/documentation/ FoundationModels

work page 2025
[15]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page 2020
[16]

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Lau- rent Sifre, and John Jumper. 2023. Accelerating Large Language Model Decoding with Speculative Sampling. Inarxiv.org

work page 2023
[17]

Stanley F Chen and Joshua Goodman. 1999. An Empirical Study of Smoothing Techniques for Language Modeling.Computer Speech & Language13, 4 (1999), 359–394

work page 1999
[18]

Yongheng Deng, Ziqing Qiao, Ye Zhang, Zhenya Ma, Yang Liu, and Ju Ren

work page
[19]

InProceedings of the International Conference on Mobile Systems, Applications, and Services

CrossLM: A Data-free Collaborative Fine-tuning Framework for Large and Small Language Models. InProceedings of the International Conference on Mobile Systems, Applications, and Services

work page
[20]

Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Hooper, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. 2024. TinyAgent: Function Calling at the Edge. Inarxiv.org

work page 2024
[21]

Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. 2025. Plan-and-Act: Improving Planning of Agents for Long-horizon Tasks. Inarxiv.org

work page 2025
[22]

Tiantian Gan and Qiyao Sun. 2025. RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-augmented Generation. Inarxiv.org

work page 2025
[23]

Gemini Team. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. In arxiv.org

work page 2025
[24]

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-latency Inference. InProceedings of Machine Learning and Systems (MLSYS)

work page 2024
[25]

Google. 2024. https://gemini.google/assistant

work page 2024
[26]

Google Blog. 2024. Circle (or Highlight or Scribble) to Search. https://blog. google/products/search/google-circle-to-search-android

work page 2024
[27]

Google Blog. 2025. A New Era of Intelligence with Gemini 3. https://blog.google/ products/gemini/gemini-3/

work page 2025
[28]

Google Blog. 2025. Gemini CLI: Your Open-source AI Agent. https://blog.google/ technology/developers/introducing-gemini-cli-open-source-ai-agent/

work page 2025
[29]

Google Cloud. 2024. TPU v6e. https://cloud.google.com/tpu/docs/v6e

work page 2024
[30]

Google DeepMind. 2023. https://deepmind.google/models/gemini/nano

work page 2023
[31]

Google Developers. 2025. https://ai.google.dev/gemini-api/docs/function-calling

work page 2025
[32]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

work page 2024
[33]

Yufeng Gu, Alireza Khadem, Sumanth Umesh, Ning Liang, Xavier Servot, Onur Mutlu, Ravi Iyer, and Reetuparna Das. 2025. PIM is All You Need: A CXL-enabled GPU-free System for Large Language Model Inference. InProceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

work page 2025
[34]

Awni Hannun, Jagrit Digani, Angelos Katharopoulos, and Ronan Collobert. 2023. https://github.com/ml-explore/mlx

work page 2023
[35]

Mingqiang Huang, Ao Shen, Kai Li, Haoxiang Peng, Boyu Li, Yupeng Su, and Hao Yu. 2025. EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models.IEEE Transactions on Circuits and Systems I: Regular Papers(2025)

work page 2025
[36]

Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, and Ashish Panwar

Aditya K. Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, and Ashish Panwar. 2025. POD-attention: Unlocking Full Prefill-decode Overlap for Faster LLM Inference. InProceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

work page 2025
[37]

Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. 2024. An LLM Compiler for Parallel Func- tion Calling. InProceedings of the International Conference on Machine Learning (ICML)

work page 2024
[38]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM Symposium on Operating System Principles (SOSP)

work page 2023
[39]

Daniel D Lee and H Sebastian Seung. 1999. Learning the Parts of Objects by Non-negative Matrix Factorization.Nature401, 6755 (1999), 788–791

work page 1999
[40]

Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast Inference from Transformers via Speculative Decoding. InProceedings of the International Con- ference on Machine Learning (ICML)

work page 2023
[41]

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. InProceedings of the International Conference on Machine Learning (ICML)

work page 2024
[42]

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2021. What Makes Good In-Context Examples for GPT-3?. In arxiv.org

work page 2021
[43]

Shiyi Liu, Haiying Shen, Shuai Che, Mahdi Ghandi, and Mingqin Li. 2025. HERA: Hybrid Edge-cloud Resource Allocation for Cost-efficient AI Agents. Inarxiv.org

work page 2025
[44]

LM Studio. 2024. https://github.com/lmstudio-ai/mlx-engine

work page 2024
[45]

Manus AI. 2025. Manus. https://manus.im/

work page 2025
[46]

Meta. 2024. Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customiz- able Models. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge- mobile-devices

work page 2024
[47]

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. 2024. SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification. InProceedings ...

work page 2024
[48]

Microsoft. 2024. https://www.microsoft.com/en-us/windows/business/devices/ copilot-plus-pcs

work page 2024
[49]

Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2022. Met- alCL: Learning to Learn in Context. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)

work page 2022
[50]

NVIDIA. 2024. NVIDIA H100 Tensor Core GPU. https://resources.nvidia.com/en- us-hopper-architecture/nvidia-tensor-core-gpu-datasheet

work page 2024
[51]

NVIDIA. 2024. NVIDIA H200 Tensor Core GPU. https://nvdam.widen.net/s/ nb5zzzsjdf/hpc-datasheet-sc23-h200-datasheet-3002446

work page 2024
[52]

NVIDIA. 2025. NVIDIA Blackwell Architecture Technical Brief. https://resources. nvidia.com/en-us-blackwell-architecture

work page 2025
[53]

OpenAI. 2025. Introducing ChatGPT Agent: Bridging Research and Action. https://openai.com/index/introducing-chatgpt-agent/

work page 2025
[54]

OpenAI. 2025. Introducing Deep Research. https://openai.com/index/ introducing-deep-research/

work page 2025
[55]

Varatheepan Paramanayakam, Andreas Karatzas, Iraklis Anagnostopoulos, and Dimitrios Stamoulis. 2025. Less is More: Optimizing Function Calling for LLM Execution on Edge Devices. InProceedings of the Design, Automation and Test in Europe Conference (DATE)

work page 2025
[56]

Yeonhong Park, Jake Hyun, Hojoon Kim, and Jae W. Lee. 2025. DecDEC: A Systems Approach to Advancing Low-bit LLM Quantization. InProceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI)

work page 2025
[57]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Infer- ence Using Phase Splitting. InProceedings of the International Symposium on Computer Architecture (ISCA)

work page 2024
[58]

Qualcomm. 2024. https://www.qualcomm.com/products/mobile/snapdragon/ laptops-and-tablets/snapdragon-x-elite

work page 2024
[59]

Qualcomm. 2024. Hexagon NPU SDK. https://www.qualcomm.com/developer/ software/hexagon-npu-sdk

work page 2024
[60]

Samsung. 2017. https://www.samsung.com/us/apps/bixby

work page 2017
[61]

Rishov Sarkar, Hanxue Liang, Zhiwen Fan, Zhangyang Wang, and Cong Hao

work page
[62]

InProceedings of the International Conference on Computer-Aided Design

Edge-MoE: Memory-efficient Multi-task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts. InProceedings of the International Conference on Computer-Aided Design

work page
[63]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InProceed- ings of the International Conference on Neural Information Processing Systems (NeurIPS)

work page 2023
[64]

Seong Hoon Seo, Junghoon Kim, Donghyun Lee, Seonah Yoo, Seokwon Moon, Yeonhong Park, and Jae W. Lee. 2025. FACIL: Flexible DRAM Address Map- ping for SoC-PIM Cooperative On-device LLM Inference. InProceedings of the International Symposium on High-Performance Computer Architecture (HPCA)

work page 2025
[65]

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. InProceedings of the International Conference on Neural Information Processing Systems (NeurIPS)

work page 2023
[66]

Zheyu Shen, Yexiao He, Ziyao Wang, Yuning Zhang, Guoheng Sun, Wanghao Ye, and Ang Li. 2025. EdgeLoRA: An Efficient Multi-tenant LLM Serving System on Edge Devices. InProceedings of the International Conference on Mobile Systems, Jinha Chung, Byeongjun Shin, Jiin Kim, and Minsoo Rhu Applications, and Services

work page 2025
[67]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InProceedings of the International Conference on Neural Information Processing Systems (NeurIPS)

work page 2023
[68]

Simranjit Singh, Andreas Karatzas, Michael Fore, Iraklis Anagnostopoulos, and Dimitrios Stamoulis. 2024. An LLM-tool Compiler for Fused Parallel Function Calling. Inarxiv.org

work page 2024
[69]

Squeeze AI Lab. 2024. TinyAgent-7B. https://huggingface.co/squeeze-ai-lab/ TinyAgent-7B

work page 2024
[70]

Squeeze AI Lab. 2024. TinyAgent-dataset. https://huggingface.co/datasets/ squeeze-ai-lab/TinyAgent-dataset

work page 2024
[71]

Squeeze AI Lab. 2024. TinyAgent-ToolRAG. https://huggingface.co/squeeze-ai- lab/TinyAgent-ToolRAG

work page 2024
[72]

Chunlin Tian, Xinpeng Qin, Kahou Tam, Li Li, Zijian Wang, Yuanzhe Zhao, Minglei Zhang, and Chengzhong Xu. 2025. CLONE: Customizing LLMs for Efficient Latency-aware Inference at the Edge. InProceedings of the USENIX Annual Technical Conference (ATC)

work page 2025
[73]

Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Gaurav Jain, Oren Pereg, Moshe Wasserblat, and David Harel. 2025. Accelerating LLM In- ference with Lossless Speculative Decoding Algorithms for Heterogeneous Vo- cabularies. InProceedings of the International Conference on Machine Learning (ICML)

work page 2025
[74]

Haoming Wang, Boyuan Yang, Xiangyu Yin, and Wei Gao. 2025. Never Start from Scratch: Expediting On-device LLM Personalization via Explainable Model Selec- tion. InProceedings of the International Conference on Mobile Systems, Applications, and Services

work page 2025
[75]

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-agent Collaboration. InProceedings of the International Conference on Neural Information Processing Systems (NeurIPS)

work page 2024
[76]

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent: Autonomous Multi-modal Mobile Device Agent with Visual Perception. Inarxiv.org

work page 2024
[77]

WizardLM Team. 2024. WizardLM 2. https://wizardlm.github.io/WizardLM2

work page 2024
[78]

Zheng Xu, Dehao Kong, Jiaxin Liu, Jinxi Li, Jingxiang Hou, Xu Dai, Chao Li, Shaojun Wei, Yang Hu, and Shouyi Yin. 2025. WSC-LLM: Efficient LLM Service and Architecture Co-exploration for Wafer-scale Chips. InProceedings of the International Symposium on Computer Architecture (ISCA)

work page 2025
[79]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion. InProceedings of the European Conference on Computer Systems (EuroSys)

work page 2025
[80]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing Reasoning and Acting in Language Models. InProceedings of the International Conference on Learning Representations (ICLR)

work page 2023

Showing first 80 references.

[1] [1]

Gulavani, Alexey Tumanov, and Ramachandran Ramjee

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-latency Tradeoff in LLM Inference with Sarathi-serve. InProceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI)

work page 2024

[2] [2]

Gulavani, and Ramachandran Ramjee

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. Inarxiv.org

work page 2023

[3] [3]

Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Belenko, S Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2024. LLM in a Flash: Efficient Large Language Model Inference with Limited Memory. InProceedings of the ACL (Association for Computational Linguistics)

work page 2024

[4] [4]

AMD. 2025. https://www.amd.com/en/products/processors/laptop/ryzen.html

work page 2025

[5] [5]

AMD. 2025. AMD Instinct MI325X Accelerator. https://www.amd.com/content/ dam/amd/en/documents/instinct-tech-docs/product-briefs/instinct-mi325x- datasheet.pdf

work page 2025

[6] [6]

Anthropic. 2024. https://www.anthropic.com/solutions/agents

work page 2024

[7] [7]

Anthropic. 2024. Introducing the Model Context Protocol. https://www. anthropic.com/news/model-context-protocol

work page 2024

[8] [8]

Apoorv Saxena. 2023. Prompt Lookup Decoding. https://github.com/ apoorvumang/prompt-lookup-decoding

work page 2023

[9] [9]

Apple. 2010. Siri. https://www.apple.com/siri

work page 2010

[10] [10]

Apple. 2023. https://github.com/ml-explore/mlx-lm

work page 2023

[11] [11]

Apple. 2024. Apple Intelligence. https://www.apple.com/apple-intelligence

work page 2024

[12] [12]

Apple. 2024. Apple Introduces M4 Pro and M4 Max. https://www.apple.com/ newsroom/2024/10/apple-introduces-m4-pro-and-m4-max

work page 2024

[13] [13]

Apple. 2024. Introducing Apple’s On-device and Server Foundation Models. https: //machinelearning.apple.com/research/introducing-apple-foundation-models

work page 2024

[14] [14]

Apple Developer. 2025. https://developer.apple.com/documentation/ FoundationModels

work page 2025

[15] [15]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page 2020

[16] [16]

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Lau- rent Sifre, and John Jumper. 2023. Accelerating Large Language Model Decoding with Speculative Sampling. Inarxiv.org

work page 2023

[17] [17]

Stanley F Chen and Joshua Goodman. 1999. An Empirical Study of Smoothing Techniques for Language Modeling.Computer Speech & Language13, 4 (1999), 359–394

work page 1999

[18] [18]

Yongheng Deng, Ziqing Qiao, Ye Zhang, Zhenya Ma, Yang Liu, and Ju Ren

work page

[19] [19]

InProceedings of the International Conference on Mobile Systems, Applications, and Services

CrossLM: A Data-free Collaborative Fine-tuning Framework for Large and Small Language Models. InProceedings of the International Conference on Mobile Systems, Applications, and Services

work page

[20] [20]

Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Hooper, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. 2024. TinyAgent: Function Calling at the Edge. Inarxiv.org

work page 2024

[21] [21]

Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. 2025. Plan-and-Act: Improving Planning of Agents for Long-horizon Tasks. Inarxiv.org

work page 2025

[22] [22]

Tiantian Gan and Qiyao Sun. 2025. RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-augmented Generation. Inarxiv.org

work page 2025

[23] [23]

Gemini Team. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. In arxiv.org

work page 2025

[24] [24]

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-latency Inference. InProceedings of Machine Learning and Systems (MLSYS)

work page 2024

[25] [25]

Google. 2024. https://gemini.google/assistant

work page 2024

[26] [26]

Google Blog. 2024. Circle (or Highlight or Scribble) to Search. https://blog. google/products/search/google-circle-to-search-android

work page 2024

[27] [27]

Google Blog. 2025. A New Era of Intelligence with Gemini 3. https://blog.google/ products/gemini/gemini-3/

work page 2025

[28] [28]

Google Blog. 2025. Gemini CLI: Your Open-source AI Agent. https://blog.google/ technology/developers/introducing-gemini-cli-open-source-ai-agent/

work page 2025

[29] [29]

Google Cloud. 2024. TPU v6e. https://cloud.google.com/tpu/docs/v6e

work page 2024

[30] [30]

Google DeepMind. 2023. https://deepmind.google/models/gemini/nano

work page 2023

[31] [31]

Google Developers. 2025. https://ai.google.dev/gemini-api/docs/function-calling

work page 2025

[32] [32]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

work page 2024

[33] [33]

Yufeng Gu, Alireza Khadem, Sumanth Umesh, Ning Liang, Xavier Servot, Onur Mutlu, Ravi Iyer, and Reetuparna Das. 2025. PIM is All You Need: A CXL-enabled GPU-free System for Large Language Model Inference. InProceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

work page 2025

[34] [34]

Awni Hannun, Jagrit Digani, Angelos Katharopoulos, and Ronan Collobert. 2023. https://github.com/ml-explore/mlx

work page 2023

[35] [35]

Mingqiang Huang, Ao Shen, Kai Li, Haoxiang Peng, Boyu Li, Yupeng Su, and Hao Yu. 2025. EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models.IEEE Transactions on Circuits and Systems I: Regular Papers(2025)

work page 2025

[36] [36]

Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, and Ashish Panwar

Aditya K. Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, and Ashish Panwar. 2025. POD-attention: Unlocking Full Prefill-decode Overlap for Faster LLM Inference. InProceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

work page 2025

[37] [37]

Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. 2024. An LLM Compiler for Parallel Func- tion Calling. InProceedings of the International Conference on Machine Learning (ICML)

work page 2024

[38] [38]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM Symposium on Operating System Principles (SOSP)

work page 2023

[39] [39]

Daniel D Lee and H Sebastian Seung. 1999. Learning the Parts of Objects by Non-negative Matrix Factorization.Nature401, 6755 (1999), 788–791

work page 1999

[40] [40]

Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast Inference from Transformers via Speculative Decoding. InProceedings of the International Con- ference on Machine Learning (ICML)

work page 2023

[41] [41]

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. InProceedings of the International Conference on Machine Learning (ICML)

work page 2024

[42] [42]

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2021. What Makes Good In-Context Examples for GPT-3?. In arxiv.org

work page 2021

[43] [43]

Shiyi Liu, Haiying Shen, Shuai Che, Mahdi Ghandi, and Mingqin Li. 2025. HERA: Hybrid Edge-cloud Resource Allocation for Cost-efficient AI Agents. Inarxiv.org

work page 2025

[44] [44]

LM Studio. 2024. https://github.com/lmstudio-ai/mlx-engine

work page 2024

[45] [45]

Manus AI. 2025. Manus. https://manus.im/

work page 2025

[46] [46]

Meta. 2024. Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customiz- able Models. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge- mobile-devices

work page 2024

[47] [47]

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. 2024. SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification. InProceedings ...

work page 2024

[48] [48]

Microsoft. 2024. https://www.microsoft.com/en-us/windows/business/devices/ copilot-plus-pcs

work page 2024

[49] [49]

Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2022. Met- alCL: Learning to Learn in Context. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)

work page 2022

[50] [50]

NVIDIA. 2024. NVIDIA H100 Tensor Core GPU. https://resources.nvidia.com/en- us-hopper-architecture/nvidia-tensor-core-gpu-datasheet

work page 2024

[51] [51]

NVIDIA. 2024. NVIDIA H200 Tensor Core GPU. https://nvdam.widen.net/s/ nb5zzzsjdf/hpc-datasheet-sc23-h200-datasheet-3002446

work page 2024

[52] [52]

NVIDIA. 2025. NVIDIA Blackwell Architecture Technical Brief. https://resources. nvidia.com/en-us-blackwell-architecture

work page 2025

[53] [53]

OpenAI. 2025. Introducing ChatGPT Agent: Bridging Research and Action. https://openai.com/index/introducing-chatgpt-agent/

work page 2025

[54] [54]

OpenAI. 2025. Introducing Deep Research. https://openai.com/index/ introducing-deep-research/

work page 2025

[55] [55]

Varatheepan Paramanayakam, Andreas Karatzas, Iraklis Anagnostopoulos, and Dimitrios Stamoulis. 2025. Less is More: Optimizing Function Calling for LLM Execution on Edge Devices. InProceedings of the Design, Automation and Test in Europe Conference (DATE)

work page 2025

[56] [56]

Yeonhong Park, Jake Hyun, Hojoon Kim, and Jae W. Lee. 2025. DecDEC: A Systems Approach to Advancing Low-bit LLM Quantization. InProceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI)

work page 2025

[57] [57]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Infer- ence Using Phase Splitting. InProceedings of the International Symposium on Computer Architecture (ISCA)

work page 2024

[58] [58]

Qualcomm. 2024. https://www.qualcomm.com/products/mobile/snapdragon/ laptops-and-tablets/snapdragon-x-elite

work page 2024

[59] [59]

Qualcomm. 2024. Hexagon NPU SDK. https://www.qualcomm.com/developer/ software/hexagon-npu-sdk

work page 2024

[60] [60]

Samsung. 2017. https://www.samsung.com/us/apps/bixby

work page 2017

[61] [61]

Rishov Sarkar, Hanxue Liang, Zhiwen Fan, Zhangyang Wang, and Cong Hao

work page

[62] [62]

InProceedings of the International Conference on Computer-Aided Design

Edge-MoE: Memory-efficient Multi-task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts. InProceedings of the International Conference on Computer-Aided Design

work page

[63] [63]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InProceed- ings of the International Conference on Neural Information Processing Systems (NeurIPS)

work page 2023

[64] [64]

Seong Hoon Seo, Junghoon Kim, Donghyun Lee, Seonah Yoo, Seokwon Moon, Yeonhong Park, and Jae W. Lee. 2025. FACIL: Flexible DRAM Address Map- ping for SoC-PIM Cooperative On-device LLM Inference. InProceedings of the International Symposium on High-Performance Computer Architecture (HPCA)

work page 2025

[65] [65]

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. InProceedings of the International Conference on Neural Information Processing Systems (NeurIPS)

work page 2023

[66] [66]

Zheyu Shen, Yexiao He, Ziyao Wang, Yuning Zhang, Guoheng Sun, Wanghao Ye, and Ang Li. 2025. EdgeLoRA: An Efficient Multi-tenant LLM Serving System on Edge Devices. InProceedings of the International Conference on Mobile Systems, Jinha Chung, Byeongjun Shin, Jiin Kim, and Minsoo Rhu Applications, and Services

work page 2025

[67] [67]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InProceedings of the International Conference on Neural Information Processing Systems (NeurIPS)

work page 2023

[68] [68]

Simranjit Singh, Andreas Karatzas, Michael Fore, Iraklis Anagnostopoulos, and Dimitrios Stamoulis. 2024. An LLM-tool Compiler for Fused Parallel Function Calling. Inarxiv.org

work page 2024

[69] [69]

Squeeze AI Lab. 2024. TinyAgent-7B. https://huggingface.co/squeeze-ai-lab/ TinyAgent-7B

work page 2024

[70] [70]

Squeeze AI Lab. 2024. TinyAgent-dataset. https://huggingface.co/datasets/ squeeze-ai-lab/TinyAgent-dataset

work page 2024

[71] [71]

Squeeze AI Lab. 2024. TinyAgent-ToolRAG. https://huggingface.co/squeeze-ai- lab/TinyAgent-ToolRAG

work page 2024

[72] [72]

Chunlin Tian, Xinpeng Qin, Kahou Tam, Li Li, Zijian Wang, Yuanzhe Zhao, Minglei Zhang, and Chengzhong Xu. 2025. CLONE: Customizing LLMs for Efficient Latency-aware Inference at the Edge. InProceedings of the USENIX Annual Technical Conference (ATC)

work page 2025

[73] [73]

Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Gaurav Jain, Oren Pereg, Moshe Wasserblat, and David Harel. 2025. Accelerating LLM In- ference with Lossless Speculative Decoding Algorithms for Heterogeneous Vo- cabularies. InProceedings of the International Conference on Machine Learning (ICML)

work page 2025

[74] [74]

Haoming Wang, Boyuan Yang, Xiangyu Yin, and Wei Gao. 2025. Never Start from Scratch: Expediting On-device LLM Personalization via Explainable Model Selec- tion. InProceedings of the International Conference on Mobile Systems, Applications, and Services

work page 2025

[75] [75]

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-agent Collaboration. InProceedings of the International Conference on Neural Information Processing Systems (NeurIPS)

work page 2024

[76] [76]

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent: Autonomous Multi-modal Mobile Device Agent with Visual Perception. Inarxiv.org

work page 2024

[77] [77]

WizardLM Team. 2024. WizardLM 2. https://wizardlm.github.io/WizardLM2

work page 2024

[78] [78]

Zheng Xu, Dehao Kong, Jiaxin Liu, Jinxi Li, Jingxiang Hou, Xu Dai, Chao Li, Shaojun Wei, Yang Hu, and Shouyi Yin. 2025. WSC-LLM: Efficient LLM Service and Architecture Co-exploration for Wafer-scale Chips. InProceedings of the International Symposium on Computer Architecture (ISCA)

work page 2025

[79] [79]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion. InProceedings of the European Conference on Computer Systems (EuroSys)

work page 2025

[80] [80]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing Reasoning and Acting in Language Models. InProceedings of the International Conference on Learning Representations (ICLR)

work page 2023