pith. sign in

arxiv: 2508.15919 · v3 · submitted 2025-08-21 · 💻 cs.DC · cs.AI

HFX: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

Pith reviewed 2026-05-18 21:35 UTC · model grok-4.3

classification 💻 cs.DC cs.AI
keywords LLM servingSLO complianceelastic scalingrequest schedulingmulti-task workloadsNPU efficiencyD2D transfer
0
0 comments X

The pith

HFX jointly optimizes LLM request scheduling and replica scaling to meet diverse SLOs while cutting latency and hardware costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that existing LLM serving methods fall short when facing mixed requests with different performance targets and changing loads. It introduces a scheduler that estimates resource budgets ahead of time and ranks requests to protect both new arrivals and those already running. A paired scaler moves model weights quickly between devices to start new copies with little delay and works with either combined or separated prefill and decode stages. Experiments on varied workloads report gains in meeting targets, faster overall response times, and reduced processor usage. A reader would care because production systems must balance strict guarantees for users against the expense of running large models at scale.

Core claim

HFX is a production LLM serving system that jointly optimizes request scheduling and elastic scaling across model replicas to satisfy diverse SLOs. The scheduler performs proactive budget estimation and prioritization to ensure compliance for both new and in-flight requests. The scaler supports fast device-to-device weight transfer to reduce cold-start latency and accommodates both colocated and disaggregated prefill/decode deployments.

What carries the argument

The integrated scheduler-scaler that pairs proactive budget estimation with fast D2D weight transfer for SLO-aware multi-replica management.

If this is right

  • Up to 4.44 times higher SLO attainment than current systems on multi-task workloads
  • Up to 65.82 percent reduction in end-to-end latency
  • Up to 49.81 percent lower NPU usage cost
  • Effective handling of both colocated and disaggregated prefill/decode configurations

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same joint approach might lower over-provisioning in cloud clusters that run many models at once
  • Similar scheduler-scaler pairings could be tested on other distributed AI workloads beyond LLMs
  • Extending the budget estimation to include network or memory limits might uncover further savings

Load-bearing premise

The multi-task workloads and hardware setups in the tests stand in for real production environments that have mixed requests, changing prompt lengths, and frequent scaling.

What would settle it

A production trace with request patterns or hardware that differ sharply from the test set shows no gain in SLO attainment or cost metrics when HFX is used instead of prior systems.

Figures

Figures reproduced from arXiv: 2508.15919 by Hong Chang, Jiannan Wang, Kan Chen, Morgan Lindsay Heisler, Niloofar Gholipour, Parham Yassini, Qiantao Zhang, Qian Wang, Taha Shabani, Xiaolong Bai, Xinglu Wang, Ying Xiong, Yong Zhang, Zahra Yousefijamarani, Zhenan Fan.

Figure 3
Figure 3. Figure 3: | Multi-task performance on 2-task and 4-task workloads. Metrics include SLO attainment (top, higher is better), end-to-end latency (middle, lower is better), and cost (bottom, lower is better) for HyperFlexis, RR, SCORPIO, and HyperFlexis-Scaling. Results use two workers, with up to four for scaling. 8. Evaluation 8.1. Multi-task Performance Evaluation 8.1.1. Collocated Architecture Results We first evalu… view at source ↗
read the original abstract

Large language model (LLM) serving faces the dual challenge of meeting strict user-specific service-level objectives (SLOs) while minimizing computational cost under dynamic, multi-task workloads. Existing approaches either rely on static scheduling policies or focus on single-task settings, limiting their applicability in real-world deployments with heterogeneous requests, variable prompt lengths, and elastic scaling requirements. We present HFX, a production LLM serving system that jointly optimizes request scheduling and elastic scaling across model replicas to satisfy diverse SLOs. HFX introduces a \textbf{scheduler} that performs proactive budget estimation and prioritization to ensure SLO compliance for both new and in-flight requests. HFX also integrates a \textbf{scaler} that supports fast device-to-device (D2D) weight transfer, reducing cold-start latency. Additionally, the system supports both colocated and disaggregated prefill/decode deployments, enabling adaptation to diverse workload patterns and cloud environments. Through extensive experiments on multi-task workloads, we demonstrate consistently higher SLO attainment, lower end-to-end latency, and lower NPU usage cost by up to 4.44$\times$, 65.82\%, and 49.81\%, respectively, compared to state-of-the-art systems. Our results highlight the effectiveness of SLO-aware scheduling and scaling in practical LLM serving, providing a robust framework for cost-efficient and SLO-compliant deployments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents HFX, a production LLM serving system that jointly optimizes request scheduling and elastic scaling to meet diverse user-specific SLOs under dynamic multi-task workloads. The scheduler performs proactive budget estimation and prioritization for both new and in-flight requests, while the scaler enables fast device-to-device (D2D) weight transfer to reduce cold-start latency. The system supports colocated and disaggregated prefill/decode deployments. Experiments on multi-task workloads claim up to 4.44× higher SLO attainment, 65.82% lower end-to-end latency, and 49.81% lower NPU usage cost relative to state-of-the-art systems.

Significance. If the experimental results hold under representative conditions, the work would be significant for distributed systems and LLM serving research. It offers a practical joint algorithm-system design for handling heterogeneous requests and elastic scaling, addressing real deployment challenges in cost-efficient, SLO-compliant serving. The explicit support for both colocated and disaggregated modes adds flexibility for varied cloud environments.

major comments (1)
  1. [§5 (Experiments)] §5 (Experiments): The central claims rest on quantitative experimental results (up to 4.44× SLO attainment, 65.82% latency reduction, 49.81% cost reduction). The abstract and available description provide no details on the specific baselines, workload generation process for multi-task scenarios, hardware configurations, number of runs, or controls for confounds such as prompt length variation. These omissions make it difficult to assess whether the reported gains are load-bearing and reproducible.
minor comments (1)
  1. [Abstract] Abstract: The inline LaTeX '4.44$×$' should be rendered as proper text (4.44×) in the final version for readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment on the experimental section below and commit to revisions that improve the clarity and reproducibility of our results.

read point-by-point responses
  1. Referee: §5 (Experiments): The central claims rest on quantitative experimental results (up to 4.44× SLO attainment, 65.82% latency reduction, 49.81% cost reduction). The abstract and available description provide no details on the specific baselines, workload generation process for multi-task scenarios, hardware configurations, number of runs, or controls for confounds such as prompt length variation. These omissions make it difficult to assess whether the reported gains are load-bearing and reproducible.

    Authors: We agree that the current presentation of experimental details could be strengthened for better reproducibility. In the revised manuscript, we will expand §5 to include: (1) explicit listing of all baselines with version numbers and configuration parameters; (2) a detailed description of the multi-task workload generator, including how request traces are synthesized from production logs with controlled distributions of prompt lengths, task types, and arrival rates; (3) hardware specifications (NPU models, interconnect, cluster sizes) and software stack; (4) the number of independent runs (we will report results from 5 runs with mean and standard deviation); and (5) explicit controls for prompt length variation, such as stratified sampling across length buckets and per-bucket performance breakdowns. These additions will directly address the concerns while preserving the existing experimental methodology. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an engineering system (HFX) for LLM serving with a scheduler for proactive budget estimation and a scaler for D2D weight transfer, evaluated via experiments on multi-task workloads. No equations, fitted parameters, predictions, or first-principles derivations are described that could reduce to their own inputs by construction. Performance claims rest on direct empirical comparisons to baselines rather than any self-referential logic, self-citation chains, or ansatz smuggling. The argument is self-contained as a practical systems contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities are identifiable from the abstract; the contribution is a systems artifact rather than a formal derivation.

pith-pipeline@v0.9.0 · 5835 in / 1160 out tokens · 47132 ms · 2026-05-18T21:35:11.903499+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 12 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Taming throughput-latency tradeoff in llm inference with sarathi-serve

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming throughput-latency tradeoff in llm inference with sarathi-serve. Proceedings of 18th USENIX Symposium on Operating Systems Design and Implementation, 2024, Santa Clara, 2024

  3. [3]

    Friedman, Thomas Williams, Ramesh K

    Sohaib Ahmad, Hui Guan, Brian D. Friedman, Thomas Williams, Ramesh K. Sitara- man, and Thomas Woo. Proteus: A high-throughput inference-serving system with accu- racy scaling. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume1, ASPLOS ’24, page 318–334, New York, NY , ...

  4. [4]

    Ascend pytorch adapter ( torch_npu)

    Ascend Community. Ascend pytorch adapter ( torch_npu). https://github.com/A scend/pytorch. Accessed: 2025-08-18

  5. [5]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023

  6. [6]

    LLMclean: Context-aware tabular data cleaning via LLM-generated OFDs

    Fabian Biester, Mohamed Abdelaal, and Daniel Del Gaudio. LLMclean: Context-aware tabular data cleaning via LLM-generated OFDs. In European Conference on Advances in Databases and Information Systems, pages 68–78, Cham, Switzerland, 2024. Springer

  7. [7]

    Slos-serve: Optimized serving of multi-slo llms

    Siyuan Chen, Zhipeng Jia, Samira Khan, Arvind Krishnamurthy, and Phillip B Gibbons. Slos-serve: Optimized serving of multi-slo llms. arXiv preprint arXiv:2504.08784, 2025

  8. [8]

    Siyuan Chen, Zhipeng Jia, Samira Khan, Arvind Krishnamurthy, and Phillip B. Gibbons. Slos-serve: Optimized serving of multi-slo llms, 2025

  9. [9]

    SCOOT: SLO- oriented performance tuning for LLM inference engines

    Ke Cheng, Zhi Wang, Wen Hu, Tiannuo Yang, Jianguo Li, and Sheng Zhang. SCOOT: SLO- oriented performance tuning for LLM inference engines. In THE WEB CONFERENCE 2025, 2025

  10. [10]

    Elis: Efficient llm iterative scheduling system with response length predictor, 2025

    Seungbeom Choi, Jeonghoe Goo, Eunjoo Jeon, Mingyu Yang, and Minsung Jang. Elis: Efficient llm iterative scheduling system with response length predictor, 2025

  11. [11]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  12. [12]

    Cmekg_tools: Chinese medical knowledge graph tools

    CMeKG_tools Contributors. Cmekg_tools: Chinese medical knowledge graph tools. https://github.com/king-yyf/CMeKG_tools, 2023

  13. [13]

    Optimizing slo-oriented llm serving with pd-multiplexing, 2025

    Weihao Cui, Yukang Chen, Han Zhao, Ziyi Xu, Quan Chen, Xusheng Chen, Yangjie Zhou, Shixuan Sun, and Minyi Guo. Optimizing slo-oriented llm serving with pd-multiplexing, 2025. 24

  14. [14]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

  15. [15]

    Apt-serve: Adaptive request scheduling on hybrid cache for scalable llm inference serving

    Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. Apt-serve: Adaptive request scheduling on hybrid cache for scalable llm inference serving. Proc. ACM Manag. Data, 3(3), June 2025

  16. [16]

    Has-gpu: Efficient hybrid auto-scaling with fine-grained gpu allocation for slo-aware serverless inferences, 2025

    Jianfeng Gu, Puxuan Wang, Isaac Nunezand, Kai Huang, and Michael Gerndt. Has-gpu: Efficient hybrid auto-scaling with fine-grained gpu allocation for slo-aware serverless inferences, 2025

  17. [17]

    Inference without interference: Disaggregate llm inference for mixed downstream workloads, 2024

    Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. Inference without interference: Disaggregate llm inference for mixed downstream workloads, 2024

  18. [18]

    Slo-aware scheduling for large language model inferences, 2025

    Jinqi Huang, Yi Xiong, Xuebing Yu, Wenjie Huang, Entong Li, Li Zeng, and Xin Chen. Slo-aware scheduling for large language model inferences, 2025

  19. [19]

    Ascend community hardware documentation

    Huawei Technologies Co., Ltd. Ascend community hardware documentation. https: //www.hiascend.com/en/document?tag=hardware, 2025

  20. [20]

    A Survey on Large Language Models for Code Generation

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation. arXiv preprint arXiv:2406.00515, 2024

  21. [21]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  22. [22]

    Llm inference serving: Survey of recent advances and opportunities

    Baolin Li, Yankai Jiang, Vijay Gadepally, and Devesh Tiwari. Llm inference serving: Survey of recent advances and opportunities. In 2024 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–8, 2024. 25

  23. [23]

    John, and Neeraja J

    Ruihao Li, Shagnik Pal, Vineeth Narayan Pullu, Prasoon Sinha, Jeeho Ryoo, Lizy K. John, and Neeraja J. Yadwadkar. Mirage: Kv cache optimization through parameter remapping for multi-tenant llm serving, 2025

  24. [24]

    Ascend AI Processor Architecture and Programming: Principles and Applications of CANN

    Xiaoyao Liang. Ascend AI Processor Architecture and Programming: Principles and Applications of CANN. Elsevier, 2020

  25. [25]

    Ascend: a scalable and unified architecture for ubiquitous deep neural network computing : Industry track paper

    Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. Ascend: a scalable and unified architecture for ubiquitous deep neural network computing : Industry track paper. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 789–801, 2021

  26. [26]

    Ascend: a scalable and unified architecture for ubiquitous deep neural network comput- ing: Industry track paper

    Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. Ascend: a scalable and unified architecture for ubiquitous deep neural network comput- ing: Industry track paper. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 789–801. IEEE, 2021

  27. [27]

    Davinci: A scalable architecture for neural network computing

    Heng Liao, Jiajin Tu, Jing Xia, and Xiping Zhou. Davinci: A scalable architecture for neural network computing. In 2019 IEEE Hot Chips 31 Symposium (HCS), pages 1–44. IEEE Computer Society, 2019

  28. [28]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

  29. [29]

    Spotserve: Serving generative large language models on preemptible in- stances

    Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhi- hao Jia. Spotserve: Serving generative large language models on preemptible in- stances. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume2, ASPLOS ’24, page 1112–1127, New York, NY , USA, ...

  30. [30]

    Aladdin: Joint placement and scaling for slo-aware llm serving

    Chengyi Nie, Rodrigo Fonseca, and Zhenhua Liu. Aladdin: Joint placement and scaling for slo-aware llm serving. arXiv preprint arXiv:2405.06856, 2024

  31. [31]

    Splitwise: Efficient generative llm inference using phase split- ting

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase split- ting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132, 2024

  32. [32]

    Queue management for slo-oriented large language model serving

    Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew Kalbarczyk, and Ravishankar Iyer. Queue management for slo-oriented large language model serving. In Proceedings of the 2024 ACM Symposium on Cloud Computing, SoCC ’24, page 18–35, New York, NY , USA, 2024. Association for Computing Machinery

  33. [33]

    Mooncake: A kvcache-centric disaggregated architecture for llm serving, 2024

    Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: A kvcache-centric disaggregated architecture for llm serving. arXiv preprint arXiv:2407.00079, 2024. 26

  34. [34]

    Dynaserve: Unified and elastic execution for dynamic disaggregated llm serving, 2025

    Chaoyi Ruan, Yinhe Chen, Dongqi Tian, Yandong Shi, Yongji Wu, Jialin Li, and Cheng Li. Dynaserve: Unified and elastic execution for dynamic disaggregated llm serving, 2025

  35. [35]

    S-lora: Serving thousands of concurrent lora adapters.arXiv preprint arXiv:2311.03285, 2023

    Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christo- pher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, et al. S-lora: Serving thousands of concurrent lora adapters. arXiv preprint arXiv:2311.03285, 2023

  36. [36]

    Flexgen: High-throughput generative inference of large language models with a single gpu

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023

  37. [37]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

  38. [38]

    USHER: Holistic interference avoid- ance for resource optimized ML inference

    Sudipta Saha Shubha, Haiying Shen, and Anand Iyer. USHER: Holistic interference avoid- ance for resource optimized ML inference. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 947–964, Santa Clara, CA, July

  39. [39]

    Llumnix: Dynamic scheduling for large language model serving

    Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic scheduling for large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 173–191, Santa Clara, CA, July 2024. USENIX Association

  40. [40]

    Scorpio: Serving the right requests at the right time for heterogeneous slos in llm inference, 2025

    Yinghao Tang, Tingfeng Lan, Xiuqi Huang, Hui Lu, and Wei Chen. Scorpio: Serving the right requests at the right time for heterogeneous slos in llm inference, 2025

  41. [41]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupati- raju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

  42. [42]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  43. [43]

    Clinical text summarization: adapting large language models can outper- form human experts

    Dave Van Veen, Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, Eduardo Pontes Reis, Anna See- hofnerova, et al. Clinical text summarization: adapting large language models can outper- form human experts. Research Square, pages rs–3, 2023

  44. [44]

    vllm ascend plugin

    vLLM Community. vllm ascend plugin. https://github.com/vllm-project/v llm-ascend, 2024. Accessed: 2025-06-05

  45. [45]

    Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

    Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems 27 Principles, SOSP ’24, page 640–654, New York, NY , USA, 2024. Association for Comput- ing Machinery

  46. [46]

    Unlock the potential of fine-grained llm serving via dynamic module scaling, 2025

    Jingfeng Wu, Yiyuan He, Minxian Xu, Xitong Gao, Kejiang Ye, and Chengzhong Xu. Unlock the potential of fine-grained llm serving via dynamic module scaling, 2025

  47. [47]

    Arrow: Adaptive scheduling mechanisms for disaggregated llm inference architecture, 2025

    Yu Wu, Tongxuan Liu, Yuting Zeng, Siyu Wu, Jun Xiong, Xianzhe Dong, Hailong Yang, Ke Zhang, and Jing Li. Arrow: Adaptive scheduling mechanisms for disaggregated llm inference architecture, 2025

  48. [48]

    Less is more for long document summary evaluation by LLMs.arXiv preprint arXiv:2309.07382, 2023

    Yunshu Wu, Hayate Iso, Pouya Pezeshkpour, Nikita Bhutani, and Estevam Hruschka. Less is more for long document summary evaluation by LLMs.arXiv preprint arXiv:2309.07382, 2023

  49. [49]

    Kunpeng 920: The first 7-nm chiplet-based 64-core arm soc for cloud services

    Jing Xia, Chuanning Cheng, Xiping Zhou, Yuxing Hu, and Peter Chun. Kunpeng 920: The first 7-nm chiplet-based 64-core arm soc for cloud services. IEEE Micro, 41(5):67–75, 2021

  50. [50]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  51. [51]

    {BlitzScale}: Fast and live large model autoscaling with o (1) host caching

    Dingyan Zhang, Haotian Wang, Yang Liu, Xingda Wei, Yizhou Shan, Rong Chen, and Haibo Chen. {BlitzScale}: Fast and live large model autoscaling with o (1) host caching. In 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 275–293, 2025

  52. [52]

    Tempo: Application-aware llm serving with mixed slo requirements, 2025

    Wei Zhang, Zhiyu Wu, Yi Mu, Banruo Liu, Myungjin Lee, and Fan Lai. Tempo: Application-aware llm serving with mixed slo requirements, 2025

  53. [53]

    Tempo: Application-aware llm serving with mixed slo requirements.arXiv preprint arXiv:2504.20068,

    Wei Zhang, Zhiyu Wu, Yi Mu, Banruo Liu, Myungjin Lee, and Fan Lai. Tempo: Application-aware llm serving with mixed slo requirements. arXiv preprint arXiv:2504.20068, 2025

  54. [54]

    Lora land: 310 fine-tuned llms that rival gpt-4 — a technical report

    Justin Zhao, Timothy Wang, Wael Abid, Geoffrey Angus, Arnav Garg, Jeffery Kinnison, Piero Molino, Travis Addair, and Devvret Rishi. Lora land: 310 fine-tuned llms that rival gpt-4 — a technical report. https://predibase.com/blog/lora-land-fine-t uned-open-source-llms-that-outperform-gpt-4, 2024

  55. [55]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023

  56. [56]

    Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

    Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103, 2017. 28

  57. [57]

    Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, OSDI’24, USA, 2024. USENIX Association

  58. [58]

    Squeezing operator performance potential for the ascend architecture

    Yuhang Zhou, Zhibin Wang, Guyue Liu, Shipeng Li, Xi Lin, Zibo Wang, Yongzhong Wang, Fuchun Wei, Jingyi Zhang, Zhiheng Hu, et al. Squeezing operator performance potential for the ascend architecture. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume2, pages 1156–1171, 2025

  59. [59]

    Polyserve: Efficient multi-slo serving at scale, 2025

    Kan Zhu, Haiyang Shi, Le Xu, Jiaxin Shan, Arvind Krishnamurthy, Baris Kasikci, and Liguang Xie. Polyserve: Efficient multi-slo serving at scale, 2025

  60. [60]

    Serving large language models on huawei cloudmatrix384, 2025

    Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Diao, Weifeng Gao, Ke Xu, Zhangyu Chen, Shirui Lu, Zhao Qiu, Peiyang Li, Xianyu Chang, Zhengzhong Yu, Fangzheng Miao, Jia Zheng, Ying Li, Yuan Feng, Bei Wang, Zaijian Zong, Mosong Zhou, Wenli Zhou, Houjiang Chen, Xingyu Liao, Yipeng Li, Wenxiao Zhang, Ping Zhu, Yinggang Wang, Chuanjie Xiao...