Agent-X: Full Pipeline Acceleration of On-device AI Agents
Pith reviewed 2026-05-12 04:00 UTC · model grok-4.3
The pith
Prompt rewriting for prefix caching and LLM-free speculative decoding accelerate on-device AI agents by 1.61x end-to-end with no accuracy loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Agent-X accelerates both prefill and decode stages of on-device agent workloads through prompt rewriting that enables prefix caching on agent input-token patterns together with LLM-free speculative decoding that produces tokens rapidly at minimal overhead, delivering 1.61x end-to-end speedup on representative workloads with no accuracy loss and seamless integration into existing agents.
What carries the argument
Prompt rewriting tailored to agent-specific input-token patterns for prefix caching combined with LLM-free speculative decoding for token generation.
Load-bearing premise
The selected representative agentic workloads capture real-world on-device usage patterns and the two optimizations generalize across models and tasks without introducing hidden accuracy or latency trade-offs.
What would settle it
Running the same end-to-end latency and accuracy measurements on a broader collection of agentic tasks or additional LLM models and observing either no speedup or any accuracy drop would disprove the central claim.
Figures
read the original abstract
LLM-based agents deliver state-of-the-art performance across tasks but incur high end-to-end latency on edge devices. We introduce Agent-X, a software-only, accuracy-preserving framework that accelerates both the prefill and decode stages of on-device agent workloads. Agent-X's two key components rewrite prompts to leverage prefix caching tailored to agent-specific input-token patterns and enable LLM-free speculative decoding for fast token generation with minimal overhead. On representative agentic workloads, Agent-X achieves a 1.61x end-to-end speedup in real systems with no accuracy loss and can be seamlessly integrated into existing on-device AI agents. To the best of our knowledge, ours is the first to systematically characterize and eliminate latency bottlenecks in on-device agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Agent-X, a software-only, accuracy-preserving framework for accelerating on-device LLM-based agents. Its two core techniques are prompt rewriting to exploit prefix caching on agent-specific token patterns and LLM-free speculative decoding for low-overhead token generation. The central empirical claim is a 1.61x end-to-end real-system speedup on representative agentic workloads with zero accuracy loss, plus seamless integration into existing agents; the work positions itself as the first systematic characterization and elimination of prefill and decode latency bottlenecks in on-device agents.
Significance. If the reported speedup and accuracy preservation hold under broader conditions, the contribution would be practically significant for edge deployment of agentic systems, where end-to-end latency is a primary barrier. The software-only design and focus on both prefill and decode stages are strengths that could enable immediate adoption without hardware changes. The absence of parameter fitting or circular derivations in the performance numbers is a positive feature of the empirical approach.
major comments (1)
- [Abstract and Evaluation] Abstract and Evaluation section: the 1.61x end-to-end speedup claim is presented as the primary result, yet the manuscript supplies no concrete description of the 'representative agentic workloads,' their selection criteria, diversity (e.g., tool-call frequency, context length distribution, multi-turn structure), exact baselines, measurement methodology (including hardware, batching, and error bars), or ablation isolating the contribution of each component. This information is load-bearing for assessing whether the speedup generalizes beyond the tested cases.
minor comments (1)
- [Abstract] The abstract states the techniques 'can be seamlessly integrated' but provides no concrete integration examples or overhead measurements for the prompt-rewriting and speculative-decoding modules themselves.
Simulated Author's Rebuttal
We thank the referee for their insightful comments. We address the concern about the evaluation details below and will update the manuscript to provide the requested information.
read point-by-point responses
-
Referee: [Abstract and Evaluation] Abstract and Evaluation section: the 1.61x end-to-end speedup claim is presented as the primary result, yet the manuscript supplies no concrete description of the 'representative agentic workloads,' their selection criteria, diversity (e.g., tool-call frequency, context length distribution, multi-turn structure), exact baselines, measurement methodology (including hardware, batching, and error bars), or ablation isolating the contribution of each component. This information is load-bearing for assessing whether the speedup generalizes beyond the tested cases.
Authors: We acknowledge that the current manuscript could benefit from more explicit details on the workloads and evaluation methodology to strengthen the claims. In the revised version, we will add a dedicated subsection in the Evaluation section describing the representative agentic workloads in detail, including their selection criteria (e.g., based on popular on-device agent benchmarks and real-world scenarios), diversity aspects such as varying tool-call frequencies, context lengths, and multi-turn conversation structures. We will also specify the exact baselines (vanilla inference on the same hardware), the measurement setup including the target hardware (specific edge device), batch size (typically 1 for on-device), number of runs for error bars, and provide ablations showing the individual and combined contributions of prompt rewriting for prefix caching and LLM-free speculative decoding. This will allow readers to better assess the generalizability of the 1.61x speedup with no accuracy loss. revision: yes
Circularity Check
No circularity: purely empirical systems evaluation with no derivations
full rationale
The paper presents Agent-X as an empirical software framework for accelerating on-device LLM agents via prompt rewriting for prefix caching and LLM-free speculative decoding. All central claims (1.61x end-to-end speedup, no accuracy loss) are stated as direct measurements on real systems using representative workloads. No equations, parameter fittings, uniqueness theorems, or derivation chains appear in the abstract or described content. The work contains no self-referential definitions, fitted inputs renamed as predictions, or load-bearing self-citations that reduce claims to their own inputs. This is a standard empirical systems paper whose results stand or fall on experimental data rather than any closed logical loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Agent workloads exhibit reusable prefix token patterns amenable to caching
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PromptWeaver reconstructs the input prompt to enable efficient prefix caching... ExSpec introduces a lightweight, prompt-aware draft model that enables efficient speculative decoding at the edge.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use 1,022 examples from the TinyAgent fine-tuning test dataset... tool co-activation locality... NMF clustering
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Gulavani, Alexey Tumanov, and Ramachandran Ramjee
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-latency Tradeoff in LLM Inference with Sarathi-serve. InProceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI)
work page 2024
-
[2]
Gulavani, and Ramachandran Ramjee
Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. Inarxiv.org
work page 2023
-
[3]
Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Belenko, S Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2024. LLM in a Flash: Efficient Large Language Model Inference with Limited Memory. InProceedings of the ACL (Association for Computational Linguistics)
work page 2024
-
[4]
AMD. 2025. https://www.amd.com/en/products/processors/laptop/ryzen.html
work page 2025
-
[5]
AMD. 2025. AMD Instinct MI325X Accelerator. https://www.amd.com/content/ dam/amd/en/documents/instinct-tech-docs/product-briefs/instinct-mi325x- datasheet.pdf
work page 2025
-
[6]
Anthropic. 2024. https://www.anthropic.com/solutions/agents
work page 2024
-
[7]
Anthropic. 2024. Introducing the Model Context Protocol. https://www. anthropic.com/news/model-context-protocol
work page 2024
-
[8]
Apoorv Saxena. 2023. Prompt Lookup Decoding. https://github.com/ apoorvumang/prompt-lookup-decoding
work page 2023
-
[9]
Apple. 2010. Siri. https://www.apple.com/siri
work page 2010
-
[10]
Apple. 2023. https://github.com/ml-explore/mlx-lm
work page 2023
-
[11]
Apple. 2024. Apple Intelligence. https://www.apple.com/apple-intelligence
work page 2024
-
[12]
Apple. 2024. Apple Introduces M4 Pro and M4 Max. https://www.apple.com/ newsroom/2024/10/apple-introduces-m4-pro-and-m4-max
work page 2024
-
[13]
Apple. 2024. Introducing Apple’s On-device and Server Foundation Models. https: //machinelearning.apple.com/research/introducing-apple-foundation-models
work page 2024
-
[14]
Apple Developer. 2025. https://developer.apple.com/documentation/ FoundationModels
work page 2025
-
[15]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page 2020
-
[16]
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Lau- rent Sifre, and John Jumper. 2023. Accelerating Large Language Model Decoding with Speculative Sampling. Inarxiv.org
work page 2023
-
[17]
Stanley F Chen and Joshua Goodman. 1999. An Empirical Study of Smoothing Techniques for Language Modeling.Computer Speech & Language13, 4 (1999), 359–394
work page 1999
-
[18]
Yongheng Deng, Ziqing Qiao, Ye Zhang, Zhenya Ma, Yang Liu, and Ju Ren
-
[19]
InProceedings of the International Conference on Mobile Systems, Applications, and Services
CrossLM: A Data-free Collaborative Fine-tuning Framework for Large and Small Language Models. InProceedings of the International Conference on Mobile Systems, Applications, and Services
-
[20]
Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Hooper, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. 2024. TinyAgent: Function Calling at the Edge. Inarxiv.org
work page 2024
-
[21]
Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. 2025. Plan-and-Act: Improving Planning of Agents for Long-horizon Tasks. Inarxiv.org
work page 2025
-
[22]
Tiantian Gan and Qiyao Sun. 2025. RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-augmented Generation. Inarxiv.org
work page 2025
-
[23]
Gemini Team. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. In arxiv.org
work page 2025
-
[24]
In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-latency Inference. InProceedings of Machine Learning and Systems (MLSYS)
work page 2024
-
[25]
Google. 2024. https://gemini.google/assistant
work page 2024
-
[26]
Google Blog. 2024. Circle (or Highlight or Scribble) to Search. https://blog. google/products/search/google-circle-to-search-android
work page 2024
-
[27]
Google Blog. 2025. A New Era of Intelligence with Gemini 3. https://blog.google/ products/gemini/gemini-3/
work page 2025
-
[28]
Google Blog. 2025. Gemini CLI: Your Open-source AI Agent. https://blog.google/ technology/developers/introducing-gemini-cli-open-source-ai-agent/
work page 2025
-
[29]
Google Cloud. 2024. TPU v6e. https://cloud.google.com/tpu/docs/v6e
work page 2024
-
[30]
Google DeepMind. 2023. https://deepmind.google/models/gemini/nano
work page 2023
-
[31]
Google Developers. 2025. https://ai.google.dev/gemini-api/docs/function-calling
work page 2025
-
[32]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...
work page 2024
-
[33]
Yufeng Gu, Alireza Khadem, Sumanth Umesh, Ning Liang, Xavier Servot, Onur Mutlu, Ravi Iyer, and Reetuparna Das. 2025. PIM is All You Need: A CXL-enabled GPU-free System for Large Language Model Inference. InProceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)
work page 2025
-
[34]
Awni Hannun, Jagrit Digani, Angelos Katharopoulos, and Ronan Collobert. 2023. https://github.com/ml-explore/mlx
work page 2023
-
[35]
Mingqiang Huang, Ao Shen, Kai Li, Haoxiang Peng, Boyu Li, Yupeng Su, and Hao Yu. 2025. EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models.IEEE Transactions on Circuits and Systems I: Regular Papers(2025)
work page 2025
-
[36]
Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, and Ashish Panwar
Aditya K. Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, and Ashish Panwar. 2025. POD-attention: Unlocking Full Prefill-decode Overlap for Faster LLM Inference. InProceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)
work page 2025
-
[37]
Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. 2024. An LLM Compiler for Parallel Func- tion Calling. InProceedings of the International Conference on Machine Learning (ICML)
work page 2024
-
[38]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM Symposium on Operating System Principles (SOSP)
work page 2023
-
[39]
Daniel D Lee and H Sebastian Seung. 1999. Learning the Parts of Objects by Non-negative Matrix Factorization.Nature401, 6755 (1999), 788–791
work page 1999
-
[40]
Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast Inference from Transformers via Speculative Decoding. InProceedings of the International Con- ference on Machine Learning (ICML)
work page 2023
-
[41]
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. InProceedings of the International Conference on Machine Learning (ICML)
work page 2024
-
[42]
Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2021. What Makes Good In-Context Examples for GPT-3?. In arxiv.org
work page 2021
-
[43]
Shiyi Liu, Haiying Shen, Shuai Che, Mahdi Ghandi, and Mingqin Li. 2025. HERA: Hybrid Edge-cloud Resource Allocation for Cost-efficient AI Agents. Inarxiv.org
work page 2025
-
[44]
LM Studio. 2024. https://github.com/lmstudio-ai/mlx-engine
work page 2024
-
[45]
Manus AI. 2025. Manus. https://manus.im/
work page 2025
-
[46]
Meta. 2024. Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customiz- able Models. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge- mobile-devices
work page 2024
-
[47]
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. 2024. SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification. InProceedings ...
work page 2024
-
[48]
Microsoft. 2024. https://www.microsoft.com/en-us/windows/business/devices/ copilot-plus-pcs
work page 2024
-
[49]
Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2022. Met- alCL: Learning to Learn in Context. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)
work page 2022
-
[50]
NVIDIA. 2024. NVIDIA H100 Tensor Core GPU. https://resources.nvidia.com/en- us-hopper-architecture/nvidia-tensor-core-gpu-datasheet
work page 2024
-
[51]
NVIDIA. 2024. NVIDIA H200 Tensor Core GPU. https://nvdam.widen.net/s/ nb5zzzsjdf/hpc-datasheet-sc23-h200-datasheet-3002446
work page 2024
-
[52]
NVIDIA. 2025. NVIDIA Blackwell Architecture Technical Brief. https://resources. nvidia.com/en-us-blackwell-architecture
work page 2025
-
[53]
OpenAI. 2025. Introducing ChatGPT Agent: Bridging Research and Action. https://openai.com/index/introducing-chatgpt-agent/
work page 2025
-
[54]
OpenAI. 2025. Introducing Deep Research. https://openai.com/index/ introducing-deep-research/
work page 2025
-
[55]
Varatheepan Paramanayakam, Andreas Karatzas, Iraklis Anagnostopoulos, and Dimitrios Stamoulis. 2025. Less is More: Optimizing Function Calling for LLM Execution on Edge Devices. InProceedings of the Design, Automation and Test in Europe Conference (DATE)
work page 2025
-
[56]
Yeonhong Park, Jake Hyun, Hojoon Kim, and Jae W. Lee. 2025. DecDEC: A Systems Approach to Advancing Low-bit LLM Quantization. InProceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI)
work page 2025
-
[57]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Infer- ence Using Phase Splitting. InProceedings of the International Symposium on Computer Architecture (ISCA)
work page 2024
-
[58]
Qualcomm. 2024. https://www.qualcomm.com/products/mobile/snapdragon/ laptops-and-tablets/snapdragon-x-elite
work page 2024
-
[59]
Qualcomm. 2024. Hexagon NPU SDK. https://www.qualcomm.com/developer/ software/hexagon-npu-sdk
work page 2024
-
[60]
Samsung. 2017. https://www.samsung.com/us/apps/bixby
work page 2017
-
[61]
Rishov Sarkar, Hanxue Liang, Zhiwen Fan, Zhangyang Wang, and Cong Hao
-
[62]
InProceedings of the International Conference on Computer-Aided Design
Edge-MoE: Memory-efficient Multi-task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts. InProceedings of the International Conference on Computer-Aided Design
-
[63]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InProceed- ings of the International Conference on Neural Information Processing Systems (NeurIPS)
work page 2023
-
[64]
Seong Hoon Seo, Junghoon Kim, Donghyun Lee, Seonah Yoo, Seokwon Moon, Yeonhong Park, and Jae W. Lee. 2025. FACIL: Flexible DRAM Address Map- ping for SoC-PIM Cooperative On-device LLM Inference. InProceedings of the International Symposium on High-Performance Computer Architecture (HPCA)
work page 2025
-
[65]
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. InProceedings of the International Conference on Neural Information Processing Systems (NeurIPS)
work page 2023
-
[66]
Zheyu Shen, Yexiao He, Ziyao Wang, Yuning Zhang, Guoheng Sun, Wanghao Ye, and Ang Li. 2025. EdgeLoRA: An Efficient Multi-tenant LLM Serving System on Edge Devices. InProceedings of the International Conference on Mobile Systems, Jinha Chung, Byeongjun Shin, Jiin Kim, and Minsoo Rhu Applications, and Services
work page 2025
-
[67]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InProceedings of the International Conference on Neural Information Processing Systems (NeurIPS)
work page 2023
-
[68]
Simranjit Singh, Andreas Karatzas, Michael Fore, Iraklis Anagnostopoulos, and Dimitrios Stamoulis. 2024. An LLM-tool Compiler for Fused Parallel Function Calling. Inarxiv.org
work page 2024
-
[69]
Squeeze AI Lab. 2024. TinyAgent-7B. https://huggingface.co/squeeze-ai-lab/ TinyAgent-7B
work page 2024
-
[70]
Squeeze AI Lab. 2024. TinyAgent-dataset. https://huggingface.co/datasets/ squeeze-ai-lab/TinyAgent-dataset
work page 2024
-
[71]
Squeeze AI Lab. 2024. TinyAgent-ToolRAG. https://huggingface.co/squeeze-ai- lab/TinyAgent-ToolRAG
work page 2024
-
[72]
Chunlin Tian, Xinpeng Qin, Kahou Tam, Li Li, Zijian Wang, Yuanzhe Zhao, Minglei Zhang, and Chengzhong Xu. 2025. CLONE: Customizing LLMs for Efficient Latency-aware Inference at the Edge. InProceedings of the USENIX Annual Technical Conference (ATC)
work page 2025
-
[73]
Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Gaurav Jain, Oren Pereg, Moshe Wasserblat, and David Harel. 2025. Accelerating LLM In- ference with Lossless Speculative Decoding Algorithms for Heterogeneous Vo- cabularies. InProceedings of the International Conference on Machine Learning (ICML)
work page 2025
-
[74]
Haoming Wang, Boyuan Yang, Xiangyu Yin, and Wei Gao. 2025. Never Start from Scratch: Expediting On-device LLM Personalization via Explainable Model Selec- tion. InProceedings of the International Conference on Mobile Systems, Applications, and Services
work page 2025
-
[75]
Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-agent Collaboration. InProceedings of the International Conference on Neural Information Processing Systems (NeurIPS)
work page 2024
-
[76]
Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent: Autonomous Multi-modal Mobile Device Agent with Visual Perception. Inarxiv.org
work page 2024
-
[77]
WizardLM Team. 2024. WizardLM 2. https://wizardlm.github.io/WizardLM2
work page 2024
-
[78]
Zheng Xu, Dehao Kong, Jiaxin Liu, Jinxi Li, Jingxiang Hou, Xu Dai, Chao Li, Shaojun Wei, Yang Hu, and Shouyi Yin. 2025. WSC-LLM: Efficient LLM Service and Architecture Co-exploration for Wafer-scale Chips. InProceedings of the International Symposium on Computer Architecture (ISCA)
work page 2025
-
[79]
Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion. InProceedings of the European Conference on Computer Systems (EuroSys)
work page 2025
-
[80]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing Reasoning and Acting in Language Models. InProceedings of the International Conference on Learning Representations (ICLR)
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.