BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching
Pith reviewed 2026-05-23 16:52 UTC · model grok-4.3
The pith
BatchLLM achieves 1.3× to 10.8× higher throughput for large batched LLM inference by identifying shared prefixes globally and reordering token batches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BatchLLM explicitly identifies common prefixes globally so requests that share the same prefix are scheduled together to reuse KV context without premature eviction. It reorders requests to place those with a larger ratio of decoding tokens first, mixes decoding with later prefill chunks, and uses memory-centric token batching to increase token-batch sizes and GPU utilization. On this basis the system delivers 1.3× to 10.8× higher throughput than vLLM and SGLang across microbenchmarks and a representative industry workload on varied hardware.
What carries the argument
Global prefix identification combined with request reordering and memory-centric token batching, which together maximize KV reuse and keep the GPU saturated during batched prefill-decode mixes.
If this is right
- KV contexts for shared prefixes remain resident and are reused across grouped requests instead of being evicted by LRU policy.
- Decoding tokens are interleaved with prefill chunks in an order that improves GPU occupancy throughout the batch.
- Larger effective token batches raise arithmetic intensity and overall tokens processed per second.
- The measured speedups apply across different hardware setups for any workload that exhibits the described prefix-sharing pattern.
Where Pith is reading between the lines
- The same global-prefix and reordering logic could be layered onto other inference engines that currently rely on implicit caching.
- If prefix detection overhead scales sub-linearly with batch size, the technique would become attractive even for moderately dynamic request streams.
- Workloads dominated by unique prefixes would see the gains shrink, suggesting a hybrid mode that falls back to conventional scheduling when sharing is low.
Load-bearing premise
The target workloads contain enough prefix sharing and the cost of global identification plus reordering stays low enough that net throughput still rises.
What would settle it
A controlled run on a workload engineered to have zero prefix sharing in which BatchLLM shows throughput equal to or below vLLM or SGLang.
Figures
read the original abstract
Large language models (LLMs) increasingly play an important role in a wide range of information processing and management tasks in industry. Many of these tasks are performed in large batches or even offline, and the performance indicator for which is throughput. These tasks usually show the characteristic of prefix sharing, where different prompt input can partially show the common prefix. However, the existing LLM inference engines tend to optimize the streaming requests and show limitations of supporting the large batched tasks with the prefix sharing characteristic. The existing solutions use the LRU-based cache to reuse the KV context of common prefix between requests. The KV context that are about to be reused may be prematurely evicted with the implicit cache management. Besides, the streaming oriented systems do not leverage the request-batch information and can not mix the decoding tokens with the prefill chunks to the best for the batched scenarios, and thus fails to saturate the GPU. We propose BatchLLM to address the above problems. BatchLLM explicitly identifies the common prefixes globally. The requests sharing the same prefix will be scheduled together to reuse the KV context the best. BatchLLM reorders the requests and schedules the requests with larger ratio of decoding first to better mix the decoding tokens with the latter prefill chunks, and applies memory-centric token batching to enlarge the token-batch sizes, which helps to increase the GPU utilization. Extensive evaluation shows that BatchLLM outperforms vLLM and SGLang by $1.3\times$ to $10.8\times$ on a set of microbenchmarks and a typical industry workload under different hardware environments. Code is available at https://github.com/microsoft/MixLLM/tree/batchllm_vllm_064.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes BatchLLM, an inference engine for large batched LLM workloads that exhibit prefix sharing. It explicitly identifies common prefixes globally across requests (rather than relying on LRU caches), reorders requests to prioritize those with higher decoding ratios for better mixing of decode and prefill, and applies memory-centric token batching to enlarge token batches and raise GPU utilization. The central empirical claim is that these techniques yield 1.3×–10.8× throughput gains over vLLM and SGLang on microbenchmarks and a typical industry workload across hardware setups.
Significance. If the throughput claims hold after verification of workloads, baselines, and overheads, the work would be significant for offline/batched inference scenarios common in industry, where prefix sharing is frequent and streaming-oriented engines underperform. The open release of code supports reproducibility and is a clear strength.
major comments (2)
- [Abstract] Abstract: the central claim of 1.3×–10.8× speedups over vLLM and SGLang is presented without any workload characteristics (e.g., average prefix length or sharing ratio), hardware details, number of runs, error bars, or ablation results. This directly undermines assessment of whether the measured gains are load-bearing or attributable to unstated baseline differences.
- [Abstract] Abstract: the throughput-oriented claims rest on the assumption that global prefix identification plus reordering incurs low enough CPU/GPU overhead relative to KV-reuse savings. No algorithm (trie, hash, etc.), complexity bound, or breakdown of identification time versus inference time is supplied, leaving open the possibility that net gains become negative on hardware where baselines already saturate the GPU.
minor comments (1)
- [Abstract] The GitHub link is provided, which aids reproducibility; however, the abstract would benefit from a one-sentence pointer to the evaluation section for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. The comments highlight important areas where the abstract can be strengthened to better support the central claims. We address each point below and will incorporate revisions accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of 1.3×–10.8× speedups over vLLM and SGLang is presented without any workload characteristics (e.g., average prefix length or sharing ratio), hardware details, number of runs, error bars, or ablation results. This directly undermines assessment of whether the measured gains are load-bearing or attributable to unstated baseline differences.
Authors: We agree that the abstract would benefit from additional context on workloads and experimental methodology to allow readers to properly evaluate the reported speedups. In the revised version, we will expand the abstract to report representative workload statistics (including average prefix length and sharing ratio drawn from the industry workload), specify the hardware platforms used, indicate that throughput numbers are averaged over multiple runs, and reference the ablation studies already present in the evaluation section. These changes will make the claims more transparent without altering the underlying results. revision: yes
-
Referee: [Abstract] Abstract: the throughput-oriented claims rest on the assumption that global prefix identification plus reordering incurs low enough CPU/GPU overhead relative to KV-reuse savings. No algorithm (trie, hash, etc.), complexity bound, or breakdown of identification time versus inference time is supplied, leaving open the possibility that net gains become negative on hardware where baselines already saturate the GPU.
Authors: We acknowledge the validity of this concern: the current abstract does not describe the prefix identification method or quantify its overhead relative to inference time. We will revise the abstract to include a concise description of the identification approach, an asymptotic complexity bound, and a statement supported by new empirical measurements showing that identification time constitutes only a small fraction of total inference time across the evaluated hardware. These additions will directly address the possibility of negative net gains and will be backed by data added to the main text if needed. revision: yes
Circularity Check
No circularity; empirical claims rest on external system comparisons
full rationale
The paper describes an engineering optimization for batched LLM inference (global prefix identification, request reordering, memory-centric token batching) and supports its claims solely via direct runtime measurements against independent external baselines (vLLM, SGLang) on microbenchmarks and an industry workload. No equations, fitted parameters, self-citations, or internal definitions are used to derive the reported speedups; the 1.3×–10.8× figures are presented as measured outcomes rather than constructed from the method itself. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM inference consists of prefill and decode phases that can be mixed in batched execution
- domain assumption Larger token batch sizes increase GPU utilization
Forward citations
Cited by 7 Pith papers
-
MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.
-
Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference
Feather uses reinforcement learning and a Chunked Hash Tree to balance batch size against prefix homogeneity in LLM inference, delivering 2-10x higher throughput than existing schedulers.
-
MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
-
AlignedServe: Orchestrating Prefix-aware Batching to Build a High-throughput and Computing-efficient LLM Serving System
AlignedServe uses prefix-aware batching, large CPU in-flight request pools, batch scheduling, and GPU-to-GPU KV prefetching to raise decoding throughput up to 1.98x and cut latency up to 7.4x versus prior serving systems.
-
PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers
PipeMax integrates pipeline parallelism with offloading to achieve up to 2.51x higher throughput than vLLM for offline LLM inference on commodity 8-GPU servers.
-
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design
MixLLM uses global output-feature importance to set mixed bit-widths for LLM quantization and adds two-step dequantization plus software pipelining for system efficiency.
-
Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies
The paper surveys and taxonomizes inference optimization methods for large vision-language models across four categories while noting limitations and open problems.
Reference graph
Works this paper leans on
-
[1]
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. CoRR abs/2308.16369 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
AI@Meta. 2024. Llama 3 Model Card. (2024). https://github.com/meta- llama/llama3/blob/main/MODEL_CARD.md
work page 2024
-
[3]
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. 2024. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net. https: //openreview.net/forum?id=PEpbUobfJv
work page 2024
-
[4]
Zui Chen, Lei Cao, and Sam Madden. 2023. Lingua Manga: A Generic Large Language Model Centric System for Data Curation. Proc. VLDB Endow. 16, 12 (Aug. 2023), 4074–4077. https://doi.org/10.14778/3611540.3611624
-
[5]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022
work page 2022
-
[6]
Raul Castro Fernandez, Aaron J. Elmore, Michael J. Franklin, Sanjay Krishnan, and Chenhao Tan. 2023. How Large Language Models Will Disrupt Data Management. Proc. VLDB Endow. 16, 11 (July 2023), 3302–3309. https://doi.org/10.14778/36114 79.3611527
-
[7]
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. CoRR abs/2210.17323 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [8]
-
[9]
Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. In Proceedings of the 2024 USENIX Annual Technical Conference, USENIX ATC 2024, Santa Clara, CA, USA, July 10-12, 2024 . USENIX Associatio...
work page 2024
-
[10]
Pin Gao, Lingfan Yu, Yongwei Wu, and Jinyang Li. 2018. Low latency RNN inference with cellular batching. In Proceedings of the Thirteenth EuroSys Con- ference. Association for Computing Machinery, Article 31, 15 pages. https: //doi.org/10.1145/3190508.3190541
-
[11]
In Gim, Guojun Chen, Seung-Seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. In Proceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16, 2024 . mlsys.org. https://proceedings.mlsys.org/paper_files/paper/2024/hash...
work page 2024
-
[12]
Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yuxiong He. 2024. DeepSpeed-FastGen: High- throughput Text Generation for LLMs via MII and DeepSpeed-Inference. CoRR abs/2401.08671 (2024)
- [13]
- [14]
-
[15]
Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo,...
-
[16]
Fu, Christopher Ré, and Azalia Mirhoseini
Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, and Azalia Mirhoseini. 2024. Hydragen: High-Throughput LLM Inference with Shared Prefixes. arXiv:2402.05099 [cs.LG]
-
[17]
Sein Kim, Hongseok Kang, Seungyoon Choi, Donghyun Kim, Minchul Yang, and Chanyoung Park. 2024. Large Language Models meet Collaborative Filtering: An Efficient All-round LLM-based Recommender System. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24). Association for Computing Machinery, 1395–1406. https://do...
-
[18]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Zhen Zheng et al. Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023 . ACM, 611–626
work page 2023
-
[19]
Ao Li, Bojian Zheng, Gennady Pekhimenko, and Fan Long. 2022. Automatic Horizontal Fusion for GPU Kernels. In IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2022, Seoul, Korea, Republic of, April 2-6,
work page 2022
-
[20]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. InProceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16, 202...
work page 2024
-
[21]
Elyas Meguellati, Lei Han, Abraham Bernstein, Shazia Sadiq, and Gianluca Demar- tini. 2024. How Good are LLMs in Generating Personalized Advertisements?. In Companion Proceedings of the ACM Web Conference 2024 (WWW ’24). Association for Computing Machinery, 826–829. https://doi.org/10.1145/3589335.3651520
-
[22]
Maxim Milakov and Natalia Gimelshein. 2018. Online normalizer calculation for softmax. CoRR abs/1805.02867 (2018). arXiv:1805.02867 http://arxiv.org/abs/18 05.02867
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[23]
OpenAI. Cited 2024. Introducing OpenAI o1-preview. https://openai.com/index /introducing-openai-o1-preview/
work page 2024
-
[24]
OpenAI. Cited 2024. OpenAI Prompt Caching. https://platform.openai.com/docs /guides/prompt-caching
work page 2024
-
[25]
Zaifeng Pan, Zhen Zheng, Feng Zhang, Ruofan Wu, Hao Liang, Dalin Wang, Xiafei Qiu, Junjie Bai, Wei Lin, and Xiaoyong Du. 2023. RECom: A Compiler Approach to Accelerating Recommendation Model Inference with Massive Em- bedding Columns. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating S...
work page 2023
-
[26]
Zaifeng Pan, Zhen Zheng, Feng Zhang, Bing Xie, Ruofan Wu, Shaden Smith, Chuanjie Liu, Olatunji Ruwase, Xiaoyong Du, and Yufei Ding. 2024. RecFlex: Enabling Feature Heterogeneity-Aware Optimization for Deep Recommendation Models with Flexible Schedules. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and ...
work page 2024
- [27]
-
[28]
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Markus N. Rabe and Charles Staats. 2021. Self-attention Does Not Need O(n2) Memory. CoRR abs/2112.05682 (2021). arXiv:2112.05682 https://arxiv.org/abs/21 12.05682
-
[30]
Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, and Ion Stoica. 2024. Fairness in Serving Large Language Models. In 18th USENIX Symposium on Operating Systems Design and Implemen- tation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024 . USENIX Association, 965–988
work page 2024
-
[31]
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Lear...
work page 2023
-
[32]
TensorRT-LLM team. Cited 2024. TensorRT-LLM. https://github.com/NVIDIA/ TensorRT-LLM
work page 2024
-
[33]
Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL 2019). Association for Computing Machinery, 10–19
work page 2019
-
[34]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Con- ference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxbur...
work page 2017
-
[35]
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2023. Large Search Model: Redefining Search Stack in the Era of LLMs. SIGIR Forum 57, 2 (2023), 23:1–23:16
work page 2023
-
[36]
Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, and Shuaiwen Leon Song. 2023. Flash-LLM: Enabling Low- Cost and Highly-Efficient Large Generative Model Inference With Unstructured Sparsity. Proc. VLDB Endow. 17, 2 (2023), 211–224
work page 2023
-
[37]
Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, and Shuaiwen Leon Song. 2024. Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co- Design on Modern GPUs. In Proceedings of the 2024 USENIX Annu...
work page 2024
- [38]
-
[39]
Lu Ye, Ze Tao, Yong Huang, and Yang Li. 2024. ChunkAttention: Efficient Self- Attention with Prefix-Aware KV Cache and Two-Phase Partition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024 . Association for Computational Linguistics, 11608–11620
work page 2024
-
[40]
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze
-
[41]
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving. arXiv:2501.01005 [cs.DC] https://arxiv.org/abs/2501.01005
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Zihao Ye, Ruihang Lai, Bo-Ru Lu, Chien-Yu Lin, Size Zheng, Lequn Chen, Tianqi Chen, and Luis Ceze. 2024. Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding. https://flashinfer.ai/2024/02/02/cascade- inference.html
work page 2024
-
[43]
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022 . USENIX Association, 521–538
work page 2022
-
[44]
Zihuai Zhao, Wenqi Fan, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Zhen Wen, Fei Wang, Xiangyu Zhao, Jiliang Tang, and Qing Li. 2024. Recommender Systems in the Era of Large Language Models (LLMs). IEEE Transactions on Knowledge and Data Engineering 36, 11 (2024), 6889–6907. https://doi.org/10.1 109/TKDE.2024.3392335
-
[45]
Gonzalez, Clark Barrett, and Ying Sheng
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2023. Efficiently Programming Large Language Models using SGLang. (2023)
work page 2023
-
[46]
Lei Zhu, Xinjiang Wang, Wayne Zhang, and Rynson W. H. Lau. 2024. RelayAt- tention for Efficient Large Language Model Serving with Long System Prompts. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16,
work page 2024
-
[47]
Association for Computational Linguistics, 4945–4957
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.