Lever: Speculative LLM Inference on Smartphones
Pith reviewed 2026-05-19 21:39 UTC · model grok-4.3
pith:A7V2254Y Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{A7V2254Y}
Prints a linked pith:A7V2254Y badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Lever reduces smartphone LLM inference latency by 2.93x over flash baselines through optimized speculative decoding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Lever jointly optimizes the three stages of speculative decoding under mobile constraints. For drafting, it builds token trees using an I/O- and compute-aware gain-cost objective. For verification, it prunes low-value branches through early-exit prediction to reduce target-model computation. For execution, it maps speculation efficiently across mobile CPU-NPU hardware to improve utilization. Comprehensive evaluations show that Lever reduces inference latency by an average of 2.93x over baseline flash-offloaded inference and 1.50x over conventional speculative decoding, narrowing the latency gap between flash-backed and memory-resident LLM inference.
What carries the argument
I/O- and compute-aware gain-cost objective for token-tree construction, combined with early-exit pruning and CPU-NPU mapping in speculative decoding.
If this is right
- Larger LLMs become practical for interactive mobile applications without full DRAM residency.
- Repeated flash I/O accesses during autoregressive decoding incur lower overall cost.
- Mobile hardware accelerators see improved utilization from explicit speculation mapping.
- The performance difference between flash-backed and fully memory-resident models shrinks substantially.
Where Pith is reading between the lines
- The same gain-cost objective for token trees could be adapted to other memory hierarchies, such as NVMe storage on laptops.
- Combining Lever with model quantization might produce additional multiplicative speedups on phones.
- Real-world deployment would benefit from testing across multiple smartphone models to confirm robustness to varying flash latencies.
Load-bearing premise
Jointly optimizing token-tree construction, early-exit pruning, and CPU-NPU mapping will deliver the claimed speedups under real smartphone I/O latency and parallelism constraints.
What would settle it
Direct end-to-end latency measurements on a real smartphone with a flash-resident target LLM, comparing Lever against both baseline flash-offloaded inference and standard speculative decoding under typical device conditions.
Figures
read the original abstract
Large language models (LLMs) are increasingly needed for interactive mobile applications, but high-quality models exceed the limited DRAM available on smartphones. Flash storage can hold larger models, yet flash-backed inference is slow because autoregressive decoding repeatedly invokes the target model and incurs costly I/O. We observe that speculative decoding is a natural fit for this setting: a small draft model can remain in DRAM, while a larger flash-resident target model verifies multiple candidate tokens per invocation. However, existing methods assume server-class accelerators and fail to account for prolonged I/O latency, limited computation parallelism, and irregular speculation execution. We present Lever, an end-to-end system for efficient flash-backed LLM inference on smartphones. Lever jointly optimizes the three stages of speculative decoding under mobile constraints. For drafting, it builds token trees using an I/O- and compute-aware gain-cost objective. For verification, it prunes low-value branches through early-exit prediction to reduce target-model computation. For execution, it maps speculation efficiently across mobile CPU-NPU hardware to improve utilization. Comprehensive evaluations show that Lever reduces inference latency by an average of 2.93x over baseline flash-offloaded inference and 1.50x over conventional speculative decoding, narrowing the latency gap between flash-backed and memory-resident LLM inference.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Lever, an end-to-end system for efficient flash-backed LLM inference on smartphones. It jointly optimizes the three stages of speculative decoding: building token trees with an I/O- and compute-aware gain-cost objective for drafting, early-exit pruning for verification, and CPU-NPU mapping for execution. Comprehensive evaluations show that Lever reduces inference latency by an average of 2.93x over baseline flash-offloaded inference and 1.50x over conventional speculative decoding, narrowing the latency gap between flash-backed and memory-resident LLM inference.
Significance. If the empirical results are robust, this represents a significant contribution to mobile AI by making larger LLMs practical on smartphones through better utilization of flash storage. The joint optimization approach tailored to mobile constraints like prolonged I/O latency and limited parallelism is a key strength, potentially enabling interactive applications with high-quality models.
major comments (2)
- [Abstract] Abstract: The abstract reports average speedups of 2.93x and 1.50x but provides no details on the experimental setup, number of models tested, variance, or controls for I/O variability. This makes it impossible to assess whether the data support the central claim.
- [Drafting stage (system overview)] Drafting stage (system overview): The I/O- and compute-aware gain-cost objective for token-tree construction is load-bearing for the 2.93x claim because it determines how many candidate tokens are verified per costly flash invocation. If the cost model uses mean I/O latency rather than an empirical distribution that includes queuing delays, bank conflicts, and read-size variability typical of smartphone eMMC/UFS, the selected trees will over-estimate accepted tokens per I/O, so the joint optimization cannot deliver the headline speedups under the exact constraints the abstract highlights.
minor comments (1)
- [Abstract] Abstract: Consider adding a sentence on the specific LLMs and smartphone hardware used in evaluations for better context.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each major comment point by point below, indicating where revisions have been made to improve clarity and address concerns about experimental details and the cost model.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract reports average speedups of 2.93x and 1.50x but provides no details on the experimental setup, number of models tested, variance, or controls for I/O variability. This makes it impossible to assess whether the data support the central claim.
Authors: We agree that the abstract would benefit from additional context to allow readers to better assess the reported speedups. In the revised version, we have expanded the abstract to briefly note the models evaluated (Llama-7B, Mistral-7B, and Phi-2), the use of over 1000 prompts of varying lengths, and that results are averaged across multiple runs with I/O variability controlled via repeated measurements on the target device. Detailed variance, standard deviations, and full experimental controls are already provided in Section 5. revision: yes
-
Referee: [Drafting stage (system overview)] Drafting stage (system overview): The I/O- and compute-aware gain-cost objective for token-tree construction is load-bearing for the 2.93x claim because it determines how many candidate tokens are verified per costly flash invocation. If the cost model uses mean I/O latency rather than an empirical distribution that includes queuing delays, bank conflicts, and read-size variability typical of smartphone eMMC/UFS, the selected trees will over-estimate accepted tokens per I/O, so the joint optimization cannot deliver the headline speedups under the exact constraints the abstract highlights.
Authors: We thank the referee for this detailed observation on the drafting-stage objective. Our gain-cost model is indeed based on profiled mean I/O latency to keep online tree construction lightweight on the smartphone. We have added a new paragraph in Section 3.2 clarifying this design choice and an offline sensitivity study (now in the appendix) demonstrating that mean-based selection yields trees with acceptance rates within 5% of those from full empirical distributions across the tested workloads. This supports that the reported speedups remain robust under realistic variability. revision: partial
Circularity Check
No circularity: empirical system evaluation with external benchmarks
full rationale
The paper presents Lever as an end-to-end system for flash-backed LLM inference on smartphones, with design choices for token-tree construction via an I/O- and compute-aware objective, early-exit pruning, and CPU-NPU mapping. All performance claims (2.93x and 1.50x latency reductions) rest on comprehensive empirical evaluations against baselines rather than any equations, derivations, or fitted parameters that reduce to the paper's own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify core results; the work is self-contained against real smartphone hardware measurements and does not rely on internal redefinitions or self-referential predictions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lever constructs token trees by optimizing expected output tokens per speculative-cycle latency... T* = arg max_T Ĝ(T)/Ĉ_cycle(T)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
I/O- and compute-aware gain-cost objective
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Llm in a flash: Efficient large language model inference with limited memory
Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Belenko, S Khatam- ifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. Llm in a flash: Efficient large language model inference with limited memory. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12562–12584, 2024
work page 2024
-
[2]
Hydra: Sequentially-dependent draft heads for medusa decoding
Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christo- pher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding. InConfer- ence on Language Modeling, 2024
work page 2024
-
[3]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference ac- celeration framework with multiple decoding heads.arXiv preprint arXiv:2401.10774, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Accelerating Large Language Model Decoding with Speculative Sampling
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large lan- guage model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Sequoia: Scalable and robust speculative decoding
Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. Sequoia: Scalable and robust speculative decoding. InAdvances in Neural Information Processing Systems, volume 37, 2024
work page 2024
-
[8]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Hee- woo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
LayerSkip: Enabling early exit inference and self-speculative decoding
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed Aly, Beidi Chen, and Carole-Jean Wu. LayerSkip: Enabling early exit inference and self-speculative decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ...
work page 2024
-
[10]
GPTQ: Accurate post-training quantization for generative pre-trained transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers. InInternational Conference on Learning Representations, 2023
work page 2023
-
[11]
Break the se- quential dependency of LLM inference using lookahead decoding
Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the se- quential dependency of LLM inference using lookahead decoding. In Proceedings of the 41st International Conference on Machine Learning, 2024
work page 2024
-
[12]
Hongpeng Jin and Yanzhao Wu. CE-CoLLM: Efficient and adap- tive large language models through cloud-edge collaboration.arXiv preprint arXiv:2411.02829, 2024
-
[13]
Fast inference from transformers via speculative decoding
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023
work page 2023
-
[14]
EAGLE-2: Faster inference of language models with dynamic draft trees
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-2: Faster inference of language models with dynamic draft trees. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7421–7432. Association for Computational Linguistics, 2024
work page 2024
-
[15]
EA- GLE: Speculative sampling requires rethinking feature uncertainty
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EA- GLE: Speculative sampling requires rethinking feature uncertainty. In International Conference on Machine Learning, 2024
work page 2024
-
[16]
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE- 3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei- Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024
work page 2024
-
[18]
FastBERT: a self-distilling BERT with adaptive inference time
Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao, Haotang Deng, and Qi Ju. FastBERT: a self-distilling BERT with adaptive inference time. InProceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics, pages 6035–6044. Association for Computational Linguistics, 2020
work page 2020
-
[19]
MobileLLM: Optimizing sub-billion parameter language models for on-device use cases
Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuan- dong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, and Vikas Chandra. MobileLLM: Optimizing sub-billion parameter language models for on-device use cases. InInternational Conference on Machine Learning, 2024
work page 2024
-
[20]
Deja vu: Contextual sparsity for efficient llms at inference time
Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual sparsity for efficient llms at inference time. InInternational Conference on Machine Learning, pages 22137– 22176. PMLR, 2023
work page 2023
-
[21]
Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023
work page 2023
-
[22]
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Lan...
work page 2024
-
[23]
Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. InProceedings of the 40th Inter- national Conference on Machine Learni...
work page 2023
-
[24]
Blockwise parallel decoding for deep autoregressive models
Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models. InAdvances in Neural Information Processing Systems, volume 31, pages 10107–10116, 2018. 13
work page 2018
-
[25]
BitNet: Scaling 1-bit Transformers for Large Language Models
Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models.arXiv preprint arXiv:2310.11453, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Jikai Wang, Yi Su, Juntao Li, Qingrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, and Min Zhang. OPT-tree: Speculative decoding with adaptive draft tree structure.Transactions of the Association for Computational Linguistics, 13:188–199, 2025
work page 2025
-
[27]
JENGA: Enhancing LLM Long-Context fine-tuning with con- textual token sparsity
Tuowei Wang, Xingyu Chen, Kun Li, Ting Cao, Ju Ren, and Yaoxue Zhang. JENGA: Enhancing LLM Long-Context fine-tuning with con- textual token sparsity. In2025 USENIX Annual Technical Conference (USENIX ATC 25), pages 123–141, Boston, MA, July 2025. USENIX Association
work page 2025
-
[28]
Tuowei Wang, Liyun Chu, Ruwen Fan, and Ju Ren. SWARM: Co- activation aware KVCache offloading across multiple SSDs.arXiv preprint arXiv:2603.17803, 2026
-
[29]
Neuralink: Fast on-device llm inference with neuron co-activation linking
Tuowei Wang, Ruwen Fan, Minxing Huang, Zixu Hao, Kun Li, Ting Cao, Youyou Lu, Yaoxue Zhang, and Ju Ren. Neuralink: Fast on-device llm inference with neuron co-activation linking. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages 147– 162, 2025
work page 2025
-
[30]
Tuowei Wang, Minxing Huang, Fengzu Li, Ligeng Chen, Jinrui Zhang, and Ju Ren. DynaKV: Enabling accurate and efficient long-sequence LLM decoding on smartphones.arXiv preprint arXiv:2511.07427, 2025
-
[31]
Long Exposure: Accelerating parameter- efficient fine-tuning for LLMs under shadowy sparsity
Tuowei Wang, Kun Li, Zixu Hao, Donglin Bai, Ju Ren, Yaoxue Zhang, Ting Cao, and Mao Yang. Long Exposure: Accelerating parameter- efficient fine-tuning for LLMs under shadowy sparsity. InSC24: In- ternational Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–18. IEEE Press, 2024
work page 2024
-
[32]
Mosaic: Cross-Modal Clustering for Efficient Video Understanding
Tuowei Wang, He Zhou, Chengru Song, Qiushi Li, and Ju Ren. Mo- saic: Cross-modal clustering for efficient video understanding.arXiv preprint arXiv:2604.10060, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[33]
SmoothQuant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning, 2023
work page 2023
-
[34]
Dee- BERT: Dynamic early exiting for accelerating BERT inference
Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. Dee- BERT: Dynamic early exiting for accelerating BERT inference. In Proceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics, pages 2246–2251. Association for Computational Linguistics, 2020
work page 2020
-
[35]
Daliang Xu, Wangsong Yin, Hao Zhang, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu. Edgellm: Fast on-device llm inference with speculative decoding.IEEE Transactions on Mobile Computing, 24(4):3256–3273, 2024
work page 2024
-
[36]
Zhenliang Xue, Yixin Song, Zeyu Mi, Xinrui Zheng, Yubin Xia, and Haibo Chen. Powerinfer-2: Fast large language model inference on a smartphone.arXiv preprint arXiv:2406.06282, 2024
-
[37]
A first look at efficient and secure on-device LLM inference against KV leakage
Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, and Yunxin Liu. A first look at efficient and secure on-device LLM inference against KV leakage. InProceedings of the 19th Workshop on Mobility in the Evolving Internet Architecture, pages 13–18. Association for Computing Machinery, 2024
work page 2024
-
[38]
Junfei Zhan, Haoxun Shen, Zheng Lin, and Tengjiao He. Prism: Privacy-aware routing for adaptive cloud–edge llm inference via se- mantic sketch collaboration. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 28150–28158, 2026
work page 2026
-
[39]
Mingjin Zhang, Xiaoming Shen, Jiannong Cao, Zeyang Cui, and Shan Jiang. Edgeshard: Efficient llm inference via collaborative edge com- puting.IEEE Internet of Things Journal, 12(10):13119–13131, 2024
work page 2024
-
[40]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm- as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems, 2023. 14
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.