Lever: Speculative LLM Inference on Smartphones

arxiv: 2605.16786 · v1 · pith:A7V2254Ynew · submitted 2026-05-16 · 💻 cs.LG

Lever: Speculative LLM Inference on Smartphones

Tuowei Wang , Fengzu Li , Yanfan Sun , Wei Gao , Ju Ren This is my paper

Pith reviewed 2026-05-19 21:39 UTC · model grok-4.3

classification 💻 cs.LG

keywords speculative decodingLLM inferencesmartphonesflash storagemobile systemstoken treeearly-exit pruningCPU-NPU mapping

0 comments p. Extension

pith:A7V2254Y Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{A7V2254Y}

Prints a linked pith:A7V2254Y badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Lever reduces smartphone LLM inference latency by 2.93x over flash baselines through optimized speculative decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Lever as an end-to-end system for running large language models on smartphones where models exceed available DRAM and must reside in slower flash storage. It adapts speculative decoding by jointly optimizing token tree construction with an I/O- and compute-aware gain-cost objective, adding early-exit pruning during verification to skip low-value branches, and mapping tasks across CPU and NPU hardware for better utilization. These changes target the repeated costly I/O accesses that make standard flash-backed inference slow. A sympathetic reader would care because the approach narrows the speed gap to memory-resident models, enabling higher-quality on-device AI for interactive mobile apps without hardware upgrades.

Core claim

Lever jointly optimizes the three stages of speculative decoding under mobile constraints. For drafting, it builds token trees using an I/O- and compute-aware gain-cost objective. For verification, it prunes low-value branches through early-exit prediction to reduce target-model computation. For execution, it maps speculation efficiently across mobile CPU-NPU hardware to improve utilization. Comprehensive evaluations show that Lever reduces inference latency by an average of 2.93x over baseline flash-offloaded inference and 1.50x over conventional speculative decoding, narrowing the latency gap between flash-backed and memory-resident LLM inference.

What carries the argument

I/O- and compute-aware gain-cost objective for token-tree construction, combined with early-exit pruning and CPU-NPU mapping in speculative decoding.

If this is right

Larger LLMs become practical for interactive mobile applications without full DRAM residency.
Repeated flash I/O accesses during autoregressive decoding incur lower overall cost.
Mobile hardware accelerators see improved utilization from explicit speculation mapping.
The performance difference between flash-backed and fully memory-resident models shrinks substantially.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gain-cost objective for token trees could be adapted to other memory hierarchies, such as NVMe storage on laptops.
Combining Lever with model quantization might produce additional multiplicative speedups on phones.
Real-world deployment would benefit from testing across multiple smartphone models to confirm robustness to varying flash latencies.

Load-bearing premise

Jointly optimizing token-tree construction, early-exit pruning, and CPU-NPU mapping will deliver the claimed speedups under real smartphone I/O latency and parallelism constraints.

What would settle it

Direct end-to-end latency measurements on a real smartphone with a flash-resident target LLM, comparing Lever against both baseline flash-offloaded inference and standard speculative decoding under typical device conditions.

Figures

Figures reproduced from arXiv: 2605.16786 by Fengzu Li, Ju Ren, Tuowei Wang, Wei Gao, Yanfan Sun.

**Figure 1.** Figure 1: Lever utilizes speculative decoding to combine the capacity of flash-based inference with the performance of DRAM-based inference on smartphones. is stored in flash, each verification step may incur expensive I/O and can take substantially longer than draft generation itself. Moreover, mobile systems-on-chip (SoCs) offer far less parallelism than server GPUs, so verification compute also becomes a major co… view at source ↗

**Figure 2.** Figure 2: Smartphone hardware overview: (a) memory hierarchy and (b) heterogeneous compute units. devices typically have much smaller memory capacity than server-class systems, on the order of 10 GB. This capacity is insufficient for many modern LLMs, even when their weights are quantized. Second, the memory available to an LLM can be substantially smaller than the physical DRAM capacity, since DRAM is shared among… view at source ↗

**Figure 4.** Figure 4: Verification parallelism on mobile and server hardware. We compare Llama-3.1-8B and Qwen3-8B on a Snapdragon 8 Gen 3 NPU and an NVIDIA RTX 3090 GPU. (a) Verification latency under different numbers of verified tokens. (b) Achieved speedup over single-token generation. 4 8 16 32 64 Tree budget 0 2 4 6 Normalized cost (B=4=1) (a) Target calls Verify cost Decode latency 4 8 16 32 64 Tree budget 1.50 1.75 2… view at source ↗

**Figure 5.** Figure 5: Token-tree budget dilemma on a OnePlus 12. (a) Normalized target-model calls, verification cost, and decoding cost under different tree budgets. (b) Decoding speed and accepted length under different tree budgets. 3.2 Challenges These mobile-specific constraints reshape speculative decoding from a purely algorithmic optimization into a problem of algorithm-system co-design. Through a comprehensive analys… view at source ↗

**Figure 6.** Figure 6: Overview of Lever. correctness, since every candidate that may affect the final accepted sequence must be faithfully checked. Challenge #3: Execution. Efficient LLM inference on smartphones further requires fully exploiting available hardware resources. However, hardware characteristics and workload patterns are often mismatched, requiring careful adaptation. NPUs provide high throughput for regular tenso… view at source ↗

**Figure 7.** Figure 7: Key concepts in draft construction. Each expanded node generates a candidate set, and candidates whose parents are already in the tree form the frontier for greedy expansion. Lever estimates the key quantities in the objective using lightweight runtime statistics and system-aware profiling. (1) Gain. The gain of a token tree depends on how far target verification is expected to progress within the tree. Fo… view at source ↗

**Figure 8.** Figure 8: Lever uses intermediate hidden states from the target model and a lightweight predictor to score candidate branches. Scores are normalized within each candidate set, including shadow candidates, and low-value branches are pruned before completing full target-model verification. target model without modifying the target model itself: LKD = 𝜏 2 KD∑︁ 𝑡 KL softmax 𝑧𝑡 𝜏KD [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 9.** Figure 9: (a) During drafting, Lever schedules small dynamic expansions on the CPU and batches larger regular expansions on the NPU. (b) During verification, transformer computation is executed as batched NPU work, while output projection is performed on demand on the CPU only along the accepted path to avoid redundant logits computation. should be extended is determined online based on the current tokens. On the … view at source ↗

**Figure 10.** Figure 10: End-to-end decode throughput across all 48 device-model-dataset configurations, normalized to Lever. Labels above the Lever bars indicate absolute throughput in tokens/s. weights are stored in flash and streamed layer by layer during decoding. For Qwen3 models, Thinking and Non-Thinking denote the prompt settings that enable or disable the model’s reasoning mode, respectively; for Llama-3.1, Instruct deno… view at source ↗

**Figure 11.** Figure 11: Decode throughput and average accepted length across draft policies. Values are normalized to SpecInfer. speculation to mobile flash I/O, limited verification parallelism, and CPU-NPU execution costs. Lever consistently outperforms all the baselines. Its gains are larger on codegeneration and reasoning workloads, where draft continuations are more reliable, and smaller on MT-Bench, where open-ended dia… view at source ↗

**Figure 13.** Figure 13: Draft-stage latency under Single-node NPU, Single-node CPU, and batch-aware NPU scheduling. Llama 3.1-8B Qwen3 8B Qwen3 14B Llama 3.1-8B Qwen3 8B Qwen3 14B Geo. mean 0 200 400 Output-projection latency (ms) MBPP GSM8K All Eager NPU on-demand NPU on-demand CPU [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 14.** Figure 14: Output-projection latency under eager NPU, ondemand NPU, and on-demand CPU projection. Hardware-Hybrid Execution Acceleration [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly needed for interactive mobile applications, but high-quality models exceed the limited DRAM available on smartphones. Flash storage can hold larger models, yet flash-backed inference is slow because autoregressive decoding repeatedly invokes the target model and incurs costly I/O. We observe that speculative decoding is a natural fit for this setting: a small draft model can remain in DRAM, while a larger flash-resident target model verifies multiple candidate tokens per invocation. However, existing methods assume server-class accelerators and fail to account for prolonged I/O latency, limited computation parallelism, and irregular speculation execution. We present Lever, an end-to-end system for efficient flash-backed LLM inference on smartphones. Lever jointly optimizes the three stages of speculative decoding under mobile constraints. For drafting, it builds token trees using an I/O- and compute-aware gain-cost objective. For verification, it prunes low-value branches through early-exit prediction to reduce target-model computation. For execution, it maps speculation efficiently across mobile CPU-NPU hardware to improve utilization. Comprehensive evaluations show that Lever reduces inference latency by an average of 2.93x over baseline flash-offloaded inference and 1.50x over conventional speculative decoding, narrowing the latency gap between flash-backed and memory-resident LLM inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lever adapts speculative decoding to flash-backed smartphone LLMs with I/O-aware tree building, early-exit pruning, and CPU-NPU mapping, but the reported speedups rest on thin experimental reporting.

read the letter

Lever shows how to make speculative decoding practical when the target model lives on flash instead of DRAM. The draft model stays in memory while the larger model verifies batches of tokens, but the system adds three mobile-specific pieces: an I/O-and-compute gain-cost objective for building the token tree, early-exit pruning of low-value branches, and explicit mapping of the work across CPU and NPU to raise utilization. These changes target the long I/O latencies and limited parallelism that break standard speculative decoding on phones. The abstract claims this cuts latency by 2.93x versus plain flash-offloaded inference and 1.5x versus ordinary speculative decoding, which would matter for on-device apps if the numbers hold up under real conditions. The joint optimization across drafting, verification, and execution is the clearest new element; prior work on speculative decoding did not focus on flash I/O costs or phone hardware constraints. The paper does a reasonable job naming the mismatch between server assumptions and mobile realities. The main soft spot is the experimental section. The abstract gives only average speedups with no mention of models tested, number of runs, variance, or how I/O variability was measured or controlled. That makes it hard to judge whether the gains are robust. The stress-test concern about the gain-cost objective using simplified mean latencies rather than full distributions of flash queuing and tail times is worth checking in the full text; if the paper relies on static estimates, the chosen trees could overstate accepted tokens per I/O. This work is aimed at researchers building efficient on-device inference systems. A reader already working on mobile LLM serving would pick up concrete ideas about tree construction and hardware mapping. It is coherent enough and addresses a practical gap, so it deserves a serious referee even if the experiments need strengthening.

Referee Report

2 major / 1 minor

Summary. The paper presents Lever, an end-to-end system for efficient flash-backed LLM inference on smartphones. It jointly optimizes the three stages of speculative decoding: building token trees with an I/O- and compute-aware gain-cost objective for drafting, early-exit pruning for verification, and CPU-NPU mapping for execution. Comprehensive evaluations show that Lever reduces inference latency by an average of 2.93x over baseline flash-offloaded inference and 1.50x over conventional speculative decoding, narrowing the latency gap between flash-backed and memory-resident LLM inference.

Significance. If the empirical results are robust, this represents a significant contribution to mobile AI by making larger LLMs practical on smartphones through better utilization of flash storage. The joint optimization approach tailored to mobile constraints like prolonged I/O latency and limited parallelism is a key strength, potentially enabling interactive applications with high-quality models.

major comments (2)

[Abstract] Abstract: The abstract reports average speedups of 2.93x and 1.50x but provides no details on the experimental setup, number of models tested, variance, or controls for I/O variability. This makes it impossible to assess whether the data support the central claim.
[Drafting stage (system overview)] Drafting stage (system overview): The I/O- and compute-aware gain-cost objective for token-tree construction is load-bearing for the 2.93x claim because it determines how many candidate tokens are verified per costly flash invocation. If the cost model uses mean I/O latency rather than an empirical distribution that includes queuing delays, bank conflicts, and read-size variability typical of smartphone eMMC/UFS, the selected trees will over-estimate accepted tokens per I/O, so the joint optimization cannot deliver the headline speedups under the exact constraints the abstract highlights.

minor comments (1)

[Abstract] Abstract: Consider adding a sentence on the specific LLMs and smartphone hardware used in evaluations for better context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment point by point below, indicating where revisions have been made to improve clarity and address concerns about experimental details and the cost model.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract reports average speedups of 2.93x and 1.50x but provides no details on the experimental setup, number of models tested, variance, or controls for I/O variability. This makes it impossible to assess whether the data support the central claim.

Authors: We agree that the abstract would benefit from additional context to allow readers to better assess the reported speedups. In the revised version, we have expanded the abstract to briefly note the models evaluated (Llama-7B, Mistral-7B, and Phi-2), the use of over 1000 prompts of varying lengths, and that results are averaged across multiple runs with I/O variability controlled via repeated measurements on the target device. Detailed variance, standard deviations, and full experimental controls are already provided in Section 5. revision: yes
Referee: [Drafting stage (system overview)] Drafting stage (system overview): The I/O- and compute-aware gain-cost objective for token-tree construction is load-bearing for the 2.93x claim because it determines how many candidate tokens are verified per costly flash invocation. If the cost model uses mean I/O latency rather than an empirical distribution that includes queuing delays, bank conflicts, and read-size variability typical of smartphone eMMC/UFS, the selected trees will over-estimate accepted tokens per I/O, so the joint optimization cannot deliver the headline speedups under the exact constraints the abstract highlights.

Authors: We thank the referee for this detailed observation on the drafting-stage objective. Our gain-cost model is indeed based on profiled mean I/O latency to keep online tree construction lightweight on the smartphone. We have added a new paragraph in Section 3.2 clarifying this design choice and an offline sensitivity study (now in the appendix) demonstrating that mean-based selection yields trees with acceptance rates within 5% of those from full empirical distributions across the tested workloads. This supports that the reported speedups remain robust under realistic variability. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical system evaluation with external benchmarks

full rationale

The paper presents Lever as an end-to-end system for flash-backed LLM inference on smartphones, with design choices for token-tree construction via an I/O- and compute-aware objective, early-exit pruning, and CPU-NPU mapping. All performance claims (2.93x and 1.50x latency reductions) rest on comprehensive empirical evaluations against baselines rather than any equations, derivations, or fitted parameters that reduce to the paper's own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify core results; the work is self-contained against real smartphone hardware measurements and does not rely on internal redefinitions or self-referential predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The gain-cost objective and early-exit predictor are described at a high level but not formalized.

pith-pipeline@v0.9.0 · 5759 in / 1087 out tokens · 34778 ms · 2026-05-19T21:39:51.959093+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lever constructs token trees by optimizing expected output tokens per speculative-cycle latency... T* = arg max_T Ĝ(T)/Ĉ_cycle(T)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

I/O- and compute-aware gain-cost objective

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 8 internal anchors

[1]

Llm in a flash: Efficient large language model inference with limited memory

Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Belenko, S Khatam- ifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. Llm in a flash: Efficient large language model inference with limited memory. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12562–12584, 2024

work page 2024
[2]

Hydra: Sequentially-dependent draft heads for medusa decoding

Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christo- pher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding. InConfer- ence on Language Modeling, 2024

work page 2024
[3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference ac- celeration framework with multiple decoding heads.arXiv preprint arXiv:2401.10774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large lan- guage model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Sequoia: Scalable and robust speculative decoding

Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. Sequoia: Scalable and robust speculative decoding. InAdvances in Neural Information Processing Systems, volume 37, 2024

work page 2024
[8]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Hee- woo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

LayerSkip: Enabling early exit inference and self-speculative decoding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed Aly, Beidi Chen, and Carole-Jean Wu. LayerSkip: Enabling early exit inference and self-speculative decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ...

work page 2024
[10]

GPTQ: Accurate post-training quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers. InInternational Conference on Learning Representations, 2023

work page 2023
[11]

Break the se- quential dependency of LLM inference using lookahead decoding

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the se- quential dependency of LLM inference using lookahead decoding. In Proceedings of the 41st International Conference on Machine Learning, 2024

work page 2024
[12]

CE-CoLLM: Efficient and adap- tive large language models through cloud-edge collaboration.arXiv preprint arXiv:2411.02829, 2024

Hongpeng Jin and Yanzhao Wu. CE-CoLLM: Efficient and adap- tive large language models through cloud-edge collaboration.arXiv preprint arXiv:2411.02829, 2024

work page arXiv 2024
[13]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023
[14]

EAGLE-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-2: Faster inference of language models with dynamic draft trees. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7421–7432. Association for Computational Linguistics, 2024

work page 2024
[15]

EA- GLE: Speculative sampling requires rethinking feature uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EA- GLE: Speculative sampling requires rethinking feature uncertainty. In International Conference on Machine Learning, 2024

work page 2024
[16]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE- 3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei- Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

work page 2024
[18]

FastBERT: a self-distilling BERT with adaptive inference time

Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao, Haotang Deng, and Qi Ju. FastBERT: a self-distilling BERT with adaptive inference time. InProceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics, pages 6035–6044. Association for Computational Linguistics, 2020

work page 2020
[19]

MobileLLM: Optimizing sub-billion parameter language models for on-device use cases

Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuan- dong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, and Vikas Chandra. MobileLLM: Optimizing sub-billion parameter language models for on-device use cases. InInternational Conference on Machine Learning, 2024

work page 2024
[20]

Deja vu: Contextual sparsity for efficient llms at inference time

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual sparsity for efficient llms at inference time. InInternational Conference on Machine Learning, pages 22137– 22176. PMLR, 2023

work page 2023
[21]

Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

work page 2023
[22]

Specinfer: Accelerating large language model serving with tree-based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Lan...

work page 2024
[23]

Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. InProceedings of the 40th Inter- national Conference on Machine Learni...

work page 2023
[24]

Blockwise parallel decoding for deep autoregressive models

Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models. InAdvances in Neural Information Processing Systems, volume 31, pages 10107–10116, 2018. 13

work page 2018
[25]

BitNet: Scaling 1-bit Transformers for Large Language Models

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models.arXiv preprint arXiv:2310.11453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

OPT-tree: Speculative decoding with adaptive draft tree structure.Transactions of the Association for Computational Linguistics, 13:188–199, 2025

Jikai Wang, Yi Su, Juntao Li, Qingrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, and Min Zhang. OPT-tree: Speculative decoding with adaptive draft tree structure.Transactions of the Association for Computational Linguistics, 13:188–199, 2025

work page 2025
[27]

JENGA: Enhancing LLM Long-Context fine-tuning with con- textual token sparsity

Tuowei Wang, Xingyu Chen, Kun Li, Ting Cao, Ju Ren, and Yaoxue Zhang. JENGA: Enhancing LLM Long-Context fine-tuning with con- textual token sparsity. In2025 USENIX Annual Technical Conference (USENIX ATC 25), pages 123–141, Boston, MA, July 2025. USENIX Association

work page 2025
[28]

SWARM: Co- activation aware KVCache offloading across multiple SSDs.arXiv preprint arXiv:2603.17803, 2026

Tuowei Wang, Liyun Chu, Ruwen Fan, and Ju Ren. SWARM: Co- activation aware KVCache offloading across multiple SSDs.arXiv preprint arXiv:2603.17803, 2026

work page arXiv 2026
[29]

Neuralink: Fast on-device llm inference with neuron co-activation linking

Tuowei Wang, Ruwen Fan, Minxing Huang, Zixu Hao, Kun Li, Ting Cao, Youyou Lu, Yaoxue Zhang, and Ju Ren. Neuralink: Fast on-device llm inference with neuron co-activation linking. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages 147– 162, 2025

work page 2025
[30]

DynaKV: Enabling accurate and efficient long-sequence LLM decoding on smartphones.arXiv preprint arXiv:2511.07427, 2025

Tuowei Wang, Minxing Huang, Fengzu Li, Ligeng Chen, Jinrui Zhang, and Ju Ren. DynaKV: Enabling accurate and efficient long-sequence LLM decoding on smartphones.arXiv preprint arXiv:2511.07427, 2025

work page arXiv 2025
[31]

Long Exposure: Accelerating parameter- efficient fine-tuning for LLMs under shadowy sparsity

Tuowei Wang, Kun Li, Zixu Hao, Donglin Bai, Ju Ren, Yaoxue Zhang, Ting Cao, and Mao Yang. Long Exposure: Accelerating parameter- efficient fine-tuning for LLMs under shadowy sparsity. InSC24: In- ternational Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–18. IEEE Press, 2024

work page 2024
[32]

Mosaic: Cross-Modal Clustering for Efficient Video Understanding

Tuowei Wang, He Zhou, Chengru Song, Qiushi Li, and Ju Ren. Mo- saic: Cross-modal clustering for efficient video understanding.arXiv preprint arXiv:2604.10060, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

SmoothQuant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning, 2023

work page 2023
[34]

Dee- BERT: Dynamic early exiting for accelerating BERT inference

Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. Dee- BERT: Dynamic early exiting for accelerating BERT inference. In Proceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics, pages 2246–2251. Association for Computational Linguistics, 2020

work page 2020
[35]

Edgellm: Fast on-device llm inference with speculative decoding.IEEE Transactions on Mobile Computing, 24(4):3256–3273, 2024

Daliang Xu, Wangsong Yin, Hao Zhang, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu. Edgellm: Fast on-device llm inference with speculative decoding.IEEE Transactions on Mobile Computing, 24(4):3256–3273, 2024

work page 2024
[36]

Powerinfer-2: Fast large language model inference on a smartphone.arXiv preprint arXiv:2406.06282, 2024

Zhenliang Xue, Yixin Song, Zeyu Mi, Xinrui Zheng, Yubin Xia, and Haibo Chen. Powerinfer-2: Fast large language model inference on a smartphone.arXiv preprint arXiv:2406.06282, 2024

work page arXiv 2024
[37]

A first look at efficient and secure on-device LLM inference against KV leakage

Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, and Yunxin Liu. A first look at efficient and secure on-device LLM inference against KV leakage. InProceedings of the 19th Workshop on Mobility in the Evolving Internet Architecture, pages 13–18. Association for Computing Machinery, 2024

work page 2024
[38]

Prism: Privacy-aware routing for adaptive cloud–edge llm inference via se- mantic sketch collaboration

Junfei Zhan, Haoxun Shen, Zheng Lin, and Tengjiao He. Prism: Privacy-aware routing for adaptive cloud–edge llm inference via se- mantic sketch collaboration. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 28150–28158, 2026

work page 2026
[39]

Edgeshard: Efficient llm inference via collaborative edge com- puting.IEEE Internet of Things Journal, 12(10):13119–13131, 2024

Mingjin Zhang, Xiaoming Shen, Jiannong Cao, Zeyang Cui, and Shan Jiang. Edgeshard: Efficient llm inference via collaborative edge com- puting.IEEE Internet of Things Journal, 12(10):13119–13131, 2024

work page 2024
[40]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm- as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems, 2023. 14

work page 2023

[1] [1]

Llm in a flash: Efficient large language model inference with limited memory

Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Belenko, S Khatam- ifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. Llm in a flash: Efficient large language model inference with limited memory. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12562–12584, 2024

work page 2024

[2] [2]

Hydra: Sequentially-dependent draft heads for medusa decoding

Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christo- pher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding. InConfer- ence on Language Modeling, 2024

work page 2024

[3] [3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference ac- celeration framework with multiple decoding heads.arXiv preprint arXiv:2401.10774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large lan- guage model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Sequoia: Scalable and robust speculative decoding

Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. Sequoia: Scalable and robust speculative decoding. InAdvances in Neural Information Processing Systems, volume 37, 2024

work page 2024

[8] [8]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Hee- woo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

LayerSkip: Enabling early exit inference and self-speculative decoding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed Aly, Beidi Chen, and Carole-Jean Wu. LayerSkip: Enabling early exit inference and self-speculative decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ...

work page 2024

[10] [10]

GPTQ: Accurate post-training quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers. InInternational Conference on Learning Representations, 2023

work page 2023

[11] [11]

Break the se- quential dependency of LLM inference using lookahead decoding

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the se- quential dependency of LLM inference using lookahead decoding. In Proceedings of the 41st International Conference on Machine Learning, 2024

work page 2024

[12] [12]

CE-CoLLM: Efficient and adap- tive large language models through cloud-edge collaboration.arXiv preprint arXiv:2411.02829, 2024

Hongpeng Jin and Yanzhao Wu. CE-CoLLM: Efficient and adap- tive large language models through cloud-edge collaboration.arXiv preprint arXiv:2411.02829, 2024

work page arXiv 2024

[13] [13]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023

[14] [14]

EAGLE-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-2: Faster inference of language models with dynamic draft trees. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7421–7432. Association for Computational Linguistics, 2024

work page 2024

[15] [15]

EA- GLE: Speculative sampling requires rethinking feature uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EA- GLE: Speculative sampling requires rethinking feature uncertainty. In International Conference on Machine Learning, 2024

work page 2024

[16] [16]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE- 3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei- Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

work page 2024

[18] [18]

FastBERT: a self-distilling BERT with adaptive inference time

Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao, Haotang Deng, and Qi Ju. FastBERT: a self-distilling BERT with adaptive inference time. InProceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics, pages 6035–6044. Association for Computational Linguistics, 2020

work page 2020

[19] [19]

MobileLLM: Optimizing sub-billion parameter language models for on-device use cases

Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuan- dong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, and Vikas Chandra. MobileLLM: Optimizing sub-billion parameter language models for on-device use cases. InInternational Conference on Machine Learning, 2024

work page 2024

[20] [20]

Deja vu: Contextual sparsity for efficient llms at inference time

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual sparsity for efficient llms at inference time. InInternational Conference on Machine Learning, pages 22137– 22176. PMLR, 2023

work page 2023

[21] [21]

Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

work page 2023

[22] [22]

Specinfer: Accelerating large language model serving with tree-based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Lan...

work page 2024

[23] [23]

Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. InProceedings of the 40th Inter- national Conference on Machine Learni...

work page 2023

[24] [24]

Blockwise parallel decoding for deep autoregressive models

Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models. InAdvances in Neural Information Processing Systems, volume 31, pages 10107–10116, 2018. 13

work page 2018

[25] [25]

BitNet: Scaling 1-bit Transformers for Large Language Models

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models.arXiv preprint arXiv:2310.11453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

OPT-tree: Speculative decoding with adaptive draft tree structure.Transactions of the Association for Computational Linguistics, 13:188–199, 2025

Jikai Wang, Yi Su, Juntao Li, Qingrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, and Min Zhang. OPT-tree: Speculative decoding with adaptive draft tree structure.Transactions of the Association for Computational Linguistics, 13:188–199, 2025

work page 2025

[27] [27]

JENGA: Enhancing LLM Long-Context fine-tuning with con- textual token sparsity

Tuowei Wang, Xingyu Chen, Kun Li, Ting Cao, Ju Ren, and Yaoxue Zhang. JENGA: Enhancing LLM Long-Context fine-tuning with con- textual token sparsity. In2025 USENIX Annual Technical Conference (USENIX ATC 25), pages 123–141, Boston, MA, July 2025. USENIX Association

work page 2025

[28] [28]

SWARM: Co- activation aware KVCache offloading across multiple SSDs.arXiv preprint arXiv:2603.17803, 2026

Tuowei Wang, Liyun Chu, Ruwen Fan, and Ju Ren. SWARM: Co- activation aware KVCache offloading across multiple SSDs.arXiv preprint arXiv:2603.17803, 2026

work page arXiv 2026

[29] [29]

Neuralink: Fast on-device llm inference with neuron co-activation linking

Tuowei Wang, Ruwen Fan, Minxing Huang, Zixu Hao, Kun Li, Ting Cao, Youyou Lu, Yaoxue Zhang, and Ju Ren. Neuralink: Fast on-device llm inference with neuron co-activation linking. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages 147– 162, 2025

work page 2025

[30] [30]

DynaKV: Enabling accurate and efficient long-sequence LLM decoding on smartphones.arXiv preprint arXiv:2511.07427, 2025

Tuowei Wang, Minxing Huang, Fengzu Li, Ligeng Chen, Jinrui Zhang, and Ju Ren. DynaKV: Enabling accurate and efficient long-sequence LLM decoding on smartphones.arXiv preprint arXiv:2511.07427, 2025

work page arXiv 2025

[31] [31]

Long Exposure: Accelerating parameter- efficient fine-tuning for LLMs under shadowy sparsity

Tuowei Wang, Kun Li, Zixu Hao, Donglin Bai, Ju Ren, Yaoxue Zhang, Ting Cao, and Mao Yang. Long Exposure: Accelerating parameter- efficient fine-tuning for LLMs under shadowy sparsity. InSC24: In- ternational Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–18. IEEE Press, 2024

work page 2024

[32] [32]

Mosaic: Cross-Modal Clustering for Efficient Video Understanding

Tuowei Wang, He Zhou, Chengru Song, Qiushi Li, and Ju Ren. Mo- saic: Cross-modal clustering for efficient video understanding.arXiv preprint arXiv:2604.10060, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [33]

SmoothQuant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning, 2023

work page 2023

[34] [34]

Dee- BERT: Dynamic early exiting for accelerating BERT inference

Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. Dee- BERT: Dynamic early exiting for accelerating BERT inference. In Proceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics, pages 2246–2251. Association for Computational Linguistics, 2020

work page 2020

[35] [35]

Edgellm: Fast on-device llm inference with speculative decoding.IEEE Transactions on Mobile Computing, 24(4):3256–3273, 2024

Daliang Xu, Wangsong Yin, Hao Zhang, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu. Edgellm: Fast on-device llm inference with speculative decoding.IEEE Transactions on Mobile Computing, 24(4):3256–3273, 2024

work page 2024

[36] [36]

Powerinfer-2: Fast large language model inference on a smartphone.arXiv preprint arXiv:2406.06282, 2024

Zhenliang Xue, Yixin Song, Zeyu Mi, Xinrui Zheng, Yubin Xia, and Haibo Chen. Powerinfer-2: Fast large language model inference on a smartphone.arXiv preprint arXiv:2406.06282, 2024

work page arXiv 2024

[37] [37]

A first look at efficient and secure on-device LLM inference against KV leakage

Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, and Yunxin Liu. A first look at efficient and secure on-device LLM inference against KV leakage. InProceedings of the 19th Workshop on Mobility in the Evolving Internet Architecture, pages 13–18. Association for Computing Machinery, 2024

work page 2024

[38] [38]

Prism: Privacy-aware routing for adaptive cloud–edge llm inference via se- mantic sketch collaboration

Junfei Zhan, Haoxun Shen, Zheng Lin, and Tengjiao He. Prism: Privacy-aware routing for adaptive cloud–edge llm inference via se- mantic sketch collaboration. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 28150–28158, 2026

work page 2026

[39] [39]

Edgeshard: Efficient llm inference via collaborative edge com- puting.IEEE Internet of Things Journal, 12(10):13119–13131, 2024

Mingjin Zhang, Xiaoming Shen, Jiannong Cao, Zeyang Cui, and Shan Jiang. Edgeshard: Efficient llm inference via collaborative edge com- puting.IEEE Internet of Things Journal, 12(10):13119–13131, 2024

work page 2024

[40] [40]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm- as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems, 2023. 14

work page 2023