pith. sign in

arxiv: 2412.03594 · v3 · submitted 2024-11-29 · 💻 cs.CL · cs.AI· cs.DC· cs.LG

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Pith reviewed 2026-05-23 16:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.DCcs.LG
keywords LLM inferencebatched processingprefix sharingKV cachetoken batchingthroughput optimizationGPU utilization
0
0 comments X

The pith

BatchLLM achieves 1.3× to 10.8× higher throughput for large batched LLM inference by identifying shared prefixes globally and reordering token batches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that existing LLM inference engines, built around streaming requests and LRU-based KV caching, leave substantial throughput on the table for large batched or offline workloads that exhibit prefix sharing. BatchLLM instead performs explicit global identification of common prefixes, groups and reorders requests accordingly, and applies memory-centric token batching to enlarge effective batch sizes. A sympathetic reader would care because many industry tasks run in large batches where throughput, not latency, is the primary metric and where prompt prefixes often overlap. If the approach holds, it directly raises the number of tokens processed per unit time on the same hardware.

Core claim

BatchLLM explicitly identifies common prefixes globally so requests that share the same prefix are scheduled together to reuse KV context without premature eviction. It reorders requests to place those with a larger ratio of decoding tokens first, mixes decoding with later prefill chunks, and uses memory-centric token batching to increase token-batch sizes and GPU utilization. On this basis the system delivers 1.3× to 10.8× higher throughput than vLLM and SGLang across microbenchmarks and a representative industry workload on varied hardware.

What carries the argument

Global prefix identification combined with request reordering and memory-centric token batching, which together maximize KV reuse and keep the GPU saturated during batched prefill-decode mixes.

If this is right

  • KV contexts for shared prefixes remain resident and are reused across grouped requests instead of being evicted by LRU policy.
  • Decoding tokens are interleaved with prefill chunks in an order that improves GPU occupancy throughout the batch.
  • Larger effective token batches raise arithmetic intensity and overall tokens processed per second.
  • The measured speedups apply across different hardware setups for any workload that exhibits the described prefix-sharing pattern.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same global-prefix and reordering logic could be layered onto other inference engines that currently rely on implicit caching.
  • If prefix detection overhead scales sub-linearly with batch size, the technique would become attractive even for moderately dynamic request streams.
  • Workloads dominated by unique prefixes would see the gains shrink, suggesting a hybrid mode that falls back to conventional scheduling when sharing is low.

Load-bearing premise

The target workloads contain enough prefix sharing and the cost of global identification plus reordering stays low enough that net throughput still rises.

What would settle it

A controlled run on a workload engineered to have zero prefix sharing in which BatchLLM shows throughput equal to or below vLLM or SGLang.

Figures

Figures reproduced from arXiv: 2412.03594 by Chuanjie Liu, Fanghao Zhou, Gang Peng, Taosong Fang, Xin Ji, Zhen Zheng.

Figure 2
Figure 2. Figure 2: The token number in the batch processed at each [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: BatchLLM overview. also reorder the groups according to the ratio of prefill length and postpone the groups with longer prefill. It then forms the token￾batches with the consideration of the KV memory usage. This aims to better mix the decoding steps with the prefill chunks to increase the overall token-batch size. The throughput-oriented scheduling and token-batching optimization is described in Sec.4.3. … view at source ↗
Figure 4
Figure 4. Figure 4: The preprocessing to maximize the first level prefix [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Microbenchmark evaluation. The setting m/n (like 2000/200) indicates the length of shared prefix/non-shared context, sd means sharing degree. The vLLM setting with ’+ p’ (’+ c’) means prefix-caching (chunked-prefill) enabled. token-batch size when enabling the chunked-prefill of vLLM base￾line to maximize its throughput. For the kernel comparison in Sec.6.3.3, we compare BatchLLM with Cascade-Inference [41… view at source ↗
Figure 6
Figure 6. Figure 6: Microbenchmark evaluation of different sharing [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: The token saving (or reusing) ratio of different [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Token number per iteration with token-batching [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The performance comparison between the base [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
read the original abstract

Large language models (LLMs) increasingly play an important role in a wide range of information processing and management tasks in industry. Many of these tasks are performed in large batches or even offline, and the performance indicator for which is throughput. These tasks usually show the characteristic of prefix sharing, where different prompt input can partially show the common prefix. However, the existing LLM inference engines tend to optimize the streaming requests and show limitations of supporting the large batched tasks with the prefix sharing characteristic. The existing solutions use the LRU-based cache to reuse the KV context of common prefix between requests. The KV context that are about to be reused may be prematurely evicted with the implicit cache management. Besides, the streaming oriented systems do not leverage the request-batch information and can not mix the decoding tokens with the prefill chunks to the best for the batched scenarios, and thus fails to saturate the GPU. We propose BatchLLM to address the above problems. BatchLLM explicitly identifies the common prefixes globally. The requests sharing the same prefix will be scheduled together to reuse the KV context the best. BatchLLM reorders the requests and schedules the requests with larger ratio of decoding first to better mix the decoding tokens with the latter prefill chunks, and applies memory-centric token batching to enlarge the token-batch sizes, which helps to increase the GPU utilization. Extensive evaluation shows that BatchLLM outperforms vLLM and SGLang by $1.3\times$ to $10.8\times$ on a set of microbenchmarks and a typical industry workload under different hardware environments. Code is available at https://github.com/microsoft/MixLLM/tree/batchllm_vllm_064.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes BatchLLM, an inference engine for large batched LLM workloads that exhibit prefix sharing. It explicitly identifies common prefixes globally across requests (rather than relying on LRU caches), reorders requests to prioritize those with higher decoding ratios for better mixing of decode and prefill, and applies memory-centric token batching to enlarge token batches and raise GPU utilization. The central empirical claim is that these techniques yield 1.3×–10.8× throughput gains over vLLM and SGLang on microbenchmarks and a typical industry workload across hardware setups.

Significance. If the throughput claims hold after verification of workloads, baselines, and overheads, the work would be significant for offline/batched inference scenarios common in industry, where prefix sharing is frequent and streaming-oriented engines underperform. The open release of code supports reproducibility and is a clear strength.

major comments (2)
  1. [Abstract] Abstract: the central claim of 1.3×–10.8× speedups over vLLM and SGLang is presented without any workload characteristics (e.g., average prefix length or sharing ratio), hardware details, number of runs, error bars, or ablation results. This directly undermines assessment of whether the measured gains are load-bearing or attributable to unstated baseline differences.
  2. [Abstract] Abstract: the throughput-oriented claims rest on the assumption that global prefix identification plus reordering incurs low enough CPU/GPU overhead relative to KV-reuse savings. No algorithm (trie, hash, etc.), complexity bound, or breakdown of identification time versus inference time is supplied, leaving open the possibility that net gains become negative on hardware where baselines already saturate the GPU.
minor comments (1)
  1. [Abstract] The GitHub link is provided, which aids reproducibility; however, the abstract would benefit from a one-sentence pointer to the evaluation section for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The comments highlight important areas where the abstract can be strengthened to better support the central claims. We address each point below and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 1.3×–10.8× speedups over vLLM and SGLang is presented without any workload characteristics (e.g., average prefix length or sharing ratio), hardware details, number of runs, error bars, or ablation results. This directly undermines assessment of whether the measured gains are load-bearing or attributable to unstated baseline differences.

    Authors: We agree that the abstract would benefit from additional context on workloads and experimental methodology to allow readers to properly evaluate the reported speedups. In the revised version, we will expand the abstract to report representative workload statistics (including average prefix length and sharing ratio drawn from the industry workload), specify the hardware platforms used, indicate that throughput numbers are averaged over multiple runs, and reference the ablation studies already present in the evaluation section. These changes will make the claims more transparent without altering the underlying results. revision: yes

  2. Referee: [Abstract] Abstract: the throughput-oriented claims rest on the assumption that global prefix identification plus reordering incurs low enough CPU/GPU overhead relative to KV-reuse savings. No algorithm (trie, hash, etc.), complexity bound, or breakdown of identification time versus inference time is supplied, leaving open the possibility that net gains become negative on hardware where baselines already saturate the GPU.

    Authors: We acknowledge the validity of this concern: the current abstract does not describe the prefix identification method or quantify its overhead relative to inference time. We will revise the abstract to include a concise description of the identification approach, an asymptotic complexity bound, and a statement supported by new empirical measurements showing that identification time constitutes only a small fraction of total inference time across the evaluated hardware. These additions will directly address the possibility of negative net gains and will be backed by data added to the main text if needed. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external system comparisons

full rationale

The paper describes an engineering optimization for batched LLM inference (global prefix identification, request reordering, memory-centric token batching) and supports its claims solely via direct runtime measurements against independent external baselines (vLLM, SGLang) on microbenchmarks and an industry workload. No equations, fitted parameters, self-citations, or internal definitions are used to derive the reported speedups; the 1.3×–10.8× figures are presented as measured outcomes rather than constructed from the method itself. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard domain assumptions about LLM inference phases and GPU behavior rather than new fitted parameters or invented entities.

axioms (2)
  • domain assumption LLM inference consists of prefill and decode phases that can be mixed in batched execution
    Invoked when describing mixing decoding tokens with prefill chunks.
  • domain assumption Larger token batch sizes increase GPU utilization
    Basis for the memory-centric token batching strategy.

pith-pipeline@v0.9.0 · 5868 in / 1289 out tokens · 46565 ms · 2026-05-23T16:52:37.945122+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

    cs.LG 2026-05 unverdicted novelty 7.0

    MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.

  2. Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference

    cs.LG 2026-05 unverdicted novelty 6.0

    Feather uses reinforcement learning and a Chunked Hash Tree to balance batch size against prefix homogeneity in LLM inference, delivering 2-10x higher throughput than existing schedulers.

  3. MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

    cs.LG 2026-05 unverdicted novelty 6.0

    ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.

  4. AlignedServe: Orchestrating Prefix-aware Batching to Build a High-throughput and Computing-efficient LLM Serving System

    cs.DC 2026-05 unverdicted novelty 5.0

    AlignedServe uses prefix-aware batching, large CPU in-flight request pools, batch scheduling, and GPU-to-GPU KV prefetching to raise decoding throughput up to 1.98x and cut latency up to 7.4x versus prior serving systems.

  5. PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers

    cs.DC 2026-05 unverdicted novelty 5.0

    PipeMax integrates pipeline parallelism with offloading to achieve up to 2.51x higher throughput than vLLM for offline LLM inference on commodity 8-GPU servers.

  6. MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

    cs.LG 2024-12 unverdicted novelty 5.0

    MixLLM uses global output-feature importance to set mixed bit-widths for LLM quantization and adds two-step dequantization plus software pipelining for system efficiency.

  7. Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies

    cs.LG 2026-03 unverdicted novelty 2.0

    The paper surveys and taxonomizes inference optimization methods for large vision-language models across four categories while noting limitations and open problems.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 6 Pith papers · 5 internal anchors

  1. [1]

    SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

    Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. CoRR abs/2308.16369 (2023)

  2. [2]

    AI@Meta. 2024. Llama 3 Model Card. (2024). https://github.com/meta- llama/llama3/blob/main/MODEL_CARD.md

  3. [3]

    Lee, Deming Chen, and Tri Dao

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. 2024. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net. https: //openreview.net/forum?id=PEpbUobfJv

  4. [4]

    Zui Chen, Lei Cao, and Sam Madden. 2023. Lingua Manga: A Generic Large Language Model Centric System for Data Curation. Proc. VLDB Endow. 16, 12 (Aug. 2023), 4074–4077. https://doi.org/10.14778/3611540.3611624

  5. [5]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022

  6. [6]

    Elmore, Michael J

    Raul Castro Fernandez, Aaron J. Elmore, Michael J. Franklin, Sanjay Krishnan, and Chenhao Tan. 2023. How Large Language Models Will Disrupt Data Management. Proc. VLDB Endow. 16, 11 (July 2023), 3302–3309. https://doi.org/10.14778/36114 79.3611527

  7. [7]

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. CoRR abs/2210.17323 (2022)

  8. [8]

    Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, and Hao Zhang. 2024. Efficient LLM Scheduling by Learning to Rank. CoRR abs/2408.15792 (2024)

  9. [9]

    Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. In Proceedings of the 2024 USENIX Annual Technical Conference, USENIX ATC 2024, Santa Clara, CA, USA, July 10-12, 2024 . USENIX Associatio...

  10. [10]

    Pin Gao, Lingfan Yu, Yongwei Wu, and Jinyang Li. 2018. Low latency RNN inference with cellular batching. In Proceedings of the Thirteenth EuroSys Con- ference. Association for Computing Machinery, Article 31, 15 pages. https: //doi.org/10.1145/3190508.3190541

  11. [11]

    In Gim, Guojun Chen, Seung-Seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. In Proceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16, 2024 . mlsys.org. https://proceedings.mlsys.org/paper_files/paper/2024/hash...

  12. [12]

    Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yuxiong He. 2024. DeepSpeed-FastGen: High- throughput Text Generation for LLMs via MII and DeepSpeed-Inference. CoRR abs/2401.08671 (2024)

  13. [13]

    Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. 2024. Mem- Serve: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool. CoRR abs/2406.17565 (2024)

  14. [14]

    Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, and Xin Jin. 2024. RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation. CoRR abs/2404.12457 (2024)

  15. [15]

    Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo,...

  16. [16]

    Fu, Christopher Ré, and Azalia Mirhoseini

    Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, and Azalia Mirhoseini. 2024. Hydragen: High-Throughput LLM Inference with Shared Prefixes. arXiv:2402.05099 [cs.LG]

  17. [17]

    Sein Kim, Hongseok Kang, Seungyoon Choi, Donghyun Kim, Minchul Yang, and Chanyoung Park. 2024. Large Language Models meet Collaborative Filtering: An Efficient All-round LLM-based Recommender System. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24). Association for Computing Machinery, 1395–1406. https://do...

  18. [18]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Zhen Zheng et al. Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023 . ACM, 611–626

  19. [19]

    Ao Li, Bojian Zheng, Gennady Pekhimenko, and Fan Long. 2022. Automatic Horizontal Fusion for GPU Kernels. In IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2022, Seoul, Korea, Republic of, April 2-6,

  20. [20]

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. InProceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16, 202...

  21. [21]

    Elyas Meguellati, Lei Han, Abraham Bernstein, Shazia Sadiq, and Gianluca Demar- tini. 2024. How Good are LLMs in Generating Personalized Advertisements?. In Companion Proceedings of the ACM Web Conference 2024 (WWW ’24). Association for Computing Machinery, 826–829. https://doi.org/10.1145/3589335.3651520

  22. [22]

    Maxim Milakov and Natalia Gimelshein. 2018. Online normalizer calculation for softmax. CoRR abs/1805.02867 (2018). arXiv:1805.02867 http://arxiv.org/abs/18 05.02867

  23. [23]

    Cited 2024

    OpenAI. Cited 2024. Introducing OpenAI o1-preview. https://openai.com/index /introducing-openai-o1-preview/

  24. [24]

    Cited 2024

    OpenAI. Cited 2024. OpenAI Prompt Caching. https://platform.openai.com/docs /guides/prompt-caching

  25. [25]

    Zaifeng Pan, Zhen Zheng, Feng Zhang, Ruofan Wu, Hao Liang, Dalin Wang, Xiafei Qiu, Junjie Bai, Wei Lin, and Xiaoyong Du. 2023. RECom: A Compiler Approach to Accelerating Recommendation Model Inference with Massive Em- bedding Columns. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating S...

  26. [26]

    Zaifeng Pan, Zhen Zheng, Feng Zhang, Bing Xie, Ruofan Wu, Shaden Smith, Chuanjie Liu, Olatunji Ruwase, Xiaoyong Du, and Yufei Ding. 2024. RecFlex: Enabling Feature Heterogeneity-Aware Optimization for Deep Recommendation Models with Flexible Schedules. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and ...

  27. [27]

    Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2024. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. CoRR abs/2407.00079 (2024)

  28. [28]

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  29. [29]

    Rabe and Charles Staats

    Markus N. Rabe and Charles Staats. 2021. Self-attention Does Not Need O(n2) Memory. CoRR abs/2112.05682 (2021). arXiv:2112.05682 https://arxiv.org/abs/21 12.05682

  30. [30]

    Gonzalez, and Ion Stoica

    Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, and Ion Stoica. 2024. Fairness in Serving Large Language Models. In 18th USENIX Symposium on Operating Systems Design and Implemen- tation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024 . USENIX Association, 965–988

  31. [31]

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Lear...

  32. [32]

    Cited 2024

    TensorRT-LLM team. Cited 2024. TensorRT-LLM. https://github.com/NVIDIA/ TensorRT-LLM

  33. [33]

    Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL 2019). Association for Computing Machinery, 10–19

  34. [34]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Con- ference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxbur...

  35. [35]

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2023. Large Search Model: Redefining Search Stack in the Era of LLMs. SIGIR Forum 57, 2 (2023), 23:1–23:16

  36. [36]

    Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, and Shuaiwen Leon Song. 2023. Flash-LLM: Enabling Low- Cost and Highly-Efficient Large Generative Model Inference With Unstructured Sparsity. Proc. VLDB Endow. 17, 2 (2023), 211–224

  37. [37]

    Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, and Shuaiwen Leon Song. 2024. Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co- Design on Modern GPUs. In Proceedings of the 2024 USENIX Annu...

  38. [38]

    Haoyi Xiong, Jiang Bian, Yuchen Li, Xuhong Li, Mengnan Du, Shuaiqiang Wang, Dawei Yin, and Sumi Helal. 2024. When Search Engine Services meet Large Language Models: Visions and Challenges. CoRR abs/2407.00128 (2024)

  39. [39]

    Lu Ye, Ze Tao, Yong Huang, and Yang Li. 2024. ChunkAttention: Efficient Self- Attention with Prefix-Aware KV Cache and Two-Phase Partition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024 . Association for Computational Linguistics, 11608–11620

  40. [40]

    Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze

  41. [41]

    FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

    FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving. arXiv:2501.01005 [cs.DC] https://arxiv.org/abs/2501.01005

  42. [42]

    Zihao Ye, Ruihang Lai, Bo-Ru Lu, Chien-Yu Lin, Size Zheng, Lequn Chen, Tianqi Chen, and Luis Ceze. 2024. Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding. https://flashinfer.ai/2024/02/02/cascade- inference.html

  43. [43]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022 . USENIX Association, 521–538

  44. [44]

    Zihuai Zhao, Wenqi Fan, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Zhen Wen, Fei Wang, Xiangyu Zhao, Jiliang Tang, and Qing Li. 2024. Recommender Systems in the Era of Large Language Models (LLMs). IEEE Transactions on Knowledge and Data Engineering 36, 11 (2024), 6889–6907. https://doi.org/10.1 109/TKDE.2024.3392335

  45. [45]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2023. Efficiently Programming Large Language Models using SGLang. (2023)

  46. [46]

    Lei Zhu, Xinjiang Wang, Wayne Zhang, and Rynson W. H. Lau. 2024. RelayAt- tention for Efficient Large Language Model Serving with Long System Prompts. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16,

  47. [47]

    Association for Computational Linguistics, 4945–4957