HFX: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling
Pith reviewed 2026-05-18 21:35 UTC · model grok-4.3
The pith
HFX jointly optimizes LLM request scheduling and replica scaling to meet diverse SLOs while cutting latency and hardware costs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HFX is a production LLM serving system that jointly optimizes request scheduling and elastic scaling across model replicas to satisfy diverse SLOs. The scheduler performs proactive budget estimation and prioritization to ensure compliance for both new and in-flight requests. The scaler supports fast device-to-device weight transfer to reduce cold-start latency and accommodates both colocated and disaggregated prefill/decode deployments.
What carries the argument
The integrated scheduler-scaler that pairs proactive budget estimation with fast D2D weight transfer for SLO-aware multi-replica management.
If this is right
- Up to 4.44 times higher SLO attainment than current systems on multi-task workloads
- Up to 65.82 percent reduction in end-to-end latency
- Up to 49.81 percent lower NPU usage cost
- Effective handling of both colocated and disaggregated prefill/decode configurations
Where Pith is reading between the lines
- The same joint approach might lower over-provisioning in cloud clusters that run many models at once
- Similar scheduler-scaler pairings could be tested on other distributed AI workloads beyond LLMs
- Extending the budget estimation to include network or memory limits might uncover further savings
Load-bearing premise
The multi-task workloads and hardware setups in the tests stand in for real production environments that have mixed requests, changing prompt lengths, and frequent scaling.
What would settle it
A production trace with request patterns or hardware that differ sharply from the test set shows no gain in SLO attainment or cost metrics when HFX is used instead of prior systems.
Figures
read the original abstract
Large language model (LLM) serving faces the dual challenge of meeting strict user-specific service-level objectives (SLOs) while minimizing computational cost under dynamic, multi-task workloads. Existing approaches either rely on static scheduling policies or focus on single-task settings, limiting their applicability in real-world deployments with heterogeneous requests, variable prompt lengths, and elastic scaling requirements. We present HFX, a production LLM serving system that jointly optimizes request scheduling and elastic scaling across model replicas to satisfy diverse SLOs. HFX introduces a \textbf{scheduler} that performs proactive budget estimation and prioritization to ensure SLO compliance for both new and in-flight requests. HFX also integrates a \textbf{scaler} that supports fast device-to-device (D2D) weight transfer, reducing cold-start latency. Additionally, the system supports both colocated and disaggregated prefill/decode deployments, enabling adaptation to diverse workload patterns and cloud environments. Through extensive experiments on multi-task workloads, we demonstrate consistently higher SLO attainment, lower end-to-end latency, and lower NPU usage cost by up to 4.44$\times$, 65.82\%, and 49.81\%, respectively, compared to state-of-the-art systems. Our results highlight the effectiveness of SLO-aware scheduling and scaling in practical LLM serving, providing a robust framework for cost-efficient and SLO-compliant deployments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents HFX, a production LLM serving system that jointly optimizes request scheduling and elastic scaling to meet diverse user-specific SLOs under dynamic multi-task workloads. The scheduler performs proactive budget estimation and prioritization for both new and in-flight requests, while the scaler enables fast device-to-device (D2D) weight transfer to reduce cold-start latency. The system supports colocated and disaggregated prefill/decode deployments. Experiments on multi-task workloads claim up to 4.44× higher SLO attainment, 65.82% lower end-to-end latency, and 49.81% lower NPU usage cost relative to state-of-the-art systems.
Significance. If the experimental results hold under representative conditions, the work would be significant for distributed systems and LLM serving research. It offers a practical joint algorithm-system design for handling heterogeneous requests and elastic scaling, addressing real deployment challenges in cost-efficient, SLO-compliant serving. The explicit support for both colocated and disaggregated modes adds flexibility for varied cloud environments.
major comments (1)
- [§5 (Experiments)] §5 (Experiments): The central claims rest on quantitative experimental results (up to 4.44× SLO attainment, 65.82% latency reduction, 49.81% cost reduction). The abstract and available description provide no details on the specific baselines, workload generation process for multi-task scenarios, hardware configurations, number of runs, or controls for confounds such as prompt length variation. These omissions make it difficult to assess whether the reported gains are load-bearing and reproducible.
minor comments (1)
- [Abstract] Abstract: The inline LaTeX '4.44$×$' should be rendered as proper text (4.44×) in the final version for readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment on the experimental section below and commit to revisions that improve the clarity and reproducibility of our results.
read point-by-point responses
-
Referee: §5 (Experiments): The central claims rest on quantitative experimental results (up to 4.44× SLO attainment, 65.82% latency reduction, 49.81% cost reduction). The abstract and available description provide no details on the specific baselines, workload generation process for multi-task scenarios, hardware configurations, number of runs, or controls for confounds such as prompt length variation. These omissions make it difficult to assess whether the reported gains are load-bearing and reproducible.
Authors: We agree that the current presentation of experimental details could be strengthened for better reproducibility. In the revised manuscript, we will expand §5 to include: (1) explicit listing of all baselines with version numbers and configuration parameters; (2) a detailed description of the multi-task workload generator, including how request traces are synthesized from production logs with controlled distributions of prompt lengths, task types, and arrival rates; (3) hardware specifications (NPU models, interconnect, cluster sizes) and software stack; (4) the number of independent runs (we will report results from 5 runs with mean and standard deviation); and (5) explicit controls for prompt length variation, such as stratified sampling across length buckets and per-bucket performance breakdowns. These additions will directly address the concerns while preserving the existing experimental methodology. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an engineering system (HFX) for LLM serving with a scheduler for proactive budget estimation and a scaler for D2D weight transfer, evaluated via experiments on multi-task workloads. No equations, fitted parameters, predictions, or first-principles derivations are described that could reduce to their own inputs by construction. Performance claims rest on direct empirical comparisons to baselines rather than any self-referential logic, self-citation chains, or ansatz smuggling. The argument is self-contained as a practical systems contribution.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Taming throughput-latency tradeoff in llm inference with sarathi-serve
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming throughput-latency tradeoff in llm inference with sarathi-serve. Proceedings of 18th USENIX Symposium on Operating Systems Design and Implementation, 2024, Santa Clara, 2024
work page 2024
-
[3]
Friedman, Thomas Williams, Ramesh K
Sohaib Ahmad, Hui Guan, Brian D. Friedman, Thomas Williams, Ramesh K. Sitara- man, and Thomas Woo. Proteus: A high-throughput inference-serving system with accu- racy scaling. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume1, ASPLOS ’24, page 318–334, New York, NY , ...
work page 2024
-
[4]
Ascend pytorch adapter ( torch_npu)
Ascend Community. Ascend pytorch adapter ( torch_npu). https://github.com/A scend/pytorch. Accessed: 2025-08-18
work page 2025
-
[5]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
LLMclean: Context-aware tabular data cleaning via LLM-generated OFDs
Fabian Biester, Mohamed Abdelaal, and Daniel Del Gaudio. LLMclean: Context-aware tabular data cleaning via LLM-generated OFDs. In European Conference on Advances in Databases and Information Systems, pages 68–78, Cham, Switzerland, 2024. Springer
work page 2024
-
[7]
Slos-serve: Optimized serving of multi-slo llms
Siyuan Chen, Zhipeng Jia, Samira Khan, Arvind Krishnamurthy, and Phillip B Gibbons. Slos-serve: Optimized serving of multi-slo llms. arXiv preprint arXiv:2504.08784, 2025
-
[8]
Siyuan Chen, Zhipeng Jia, Samira Khan, Arvind Krishnamurthy, and Phillip B. Gibbons. Slos-serve: Optimized serving of multi-slo llms, 2025
work page 2025
-
[9]
SCOOT: SLO- oriented performance tuning for LLM inference engines
Ke Cheng, Zhi Wang, Wen Hu, Tiannuo Yang, Jianguo Li, and Sheng Zhang. SCOOT: SLO- oriented performance tuning for LLM inference engines. In THE WEB CONFERENCE 2025, 2025
work page 2025
-
[10]
Elis: Efficient llm iterative scheduling system with response length predictor, 2025
Seungbeom Choi, Jeonghoe Goo, Eunjoo Jeon, Mingyu Yang, and Minsung Jang. Elis: Efficient llm iterative scheduling system with response length predictor, 2025
work page 2025
-
[11]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Cmekg_tools: Chinese medical knowledge graph tools
CMeKG_tools Contributors. Cmekg_tools: Chinese medical knowledge graph tools. https://github.com/king-yyf/CMeKG_tools, 2023
work page 2023
-
[13]
Optimizing slo-oriented llm serving with pd-multiplexing, 2025
Weihao Cui, Yukang Chen, Han Zhao, Ziyi Xu, Quan Chen, Xusheng Chen, Yangjie Zhou, Shixuan Sun, and Minyi Guo. Optimizing slo-oriented llm serving with pd-multiplexing, 2025. 24
work page 2025
-
[14]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Apt-serve: Adaptive request scheduling on hybrid cache for scalable llm inference serving
Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. Apt-serve: Adaptive request scheduling on hybrid cache for scalable llm inference serving. Proc. ACM Manag. Data, 3(3), June 2025
work page 2025
-
[16]
Jianfeng Gu, Puxuan Wang, Isaac Nunezand, Kai Huang, and Michael Gerndt. Has-gpu: Efficient hybrid auto-scaling with fine-grained gpu allocation for slo-aware serverless inferences, 2025
work page 2025
-
[17]
Inference without interference: Disaggregate llm inference for mixed downstream workloads, 2024
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. Inference without interference: Disaggregate llm inference for mixed downstream workloads, 2024
work page 2024
-
[18]
Slo-aware scheduling for large language model inferences, 2025
Jinqi Huang, Yi Xiong, Xuebing Yu, Wenjie Huang, Entong Li, Li Zeng, and Xin Chen. Slo-aware scheduling for large language model inferences, 2025
work page 2025
-
[19]
Ascend community hardware documentation
Huawei Technologies Co., Ltd. Ascend community hardware documentation. https: //www.hiascend.com/en/document?tag=hardware, 2025
work page 2025
-
[20]
A Survey on Large Language Models for Code Generation
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation. arXiv preprint arXiv:2406.00515, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[22]
Llm inference serving: Survey of recent advances and opportunities
Baolin Li, Yankai Jiang, Vijay Gadepally, and Devesh Tiwari. Llm inference serving: Survey of recent advances and opportunities. In 2024 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–8, 2024. 25
work page 2024
-
[23]
Ruihao Li, Shagnik Pal, Vineeth Narayan Pullu, Prasoon Sinha, Jeeho Ryoo, Lizy K. John, and Neeraja J. Yadwadkar. Mirage: Kv cache optimization through parameter remapping for multi-tenant llm serving, 2025
work page 2025
-
[24]
Ascend AI Processor Architecture and Programming: Principles and Applications of CANN
Xiaoyao Liang. Ascend AI Processor Architecture and Programming: Principles and Applications of CANN. Elsevier, 2020
work page 2020
-
[25]
Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. Ascend: a scalable and unified architecture for ubiquitous deep neural network computing : Industry track paper. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 789–801, 2021
work page 2021
-
[26]
Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. Ascend: a scalable and unified architecture for ubiquitous deep neural network comput- ing: Industry track paper. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 789–801. IEEE, 2021
work page 2021
-
[27]
Davinci: A scalable architecture for neural network computing
Heng Liao, Jiajin Tu, Jing Xia, and Xiping Zhou. Davinci: A scalable architecture for neural network computing. In 2019 IEEE Hot Chips 31 Symposium (HCS), pages 1–44. IEEE Computer Society, 2019
work page 2019
-
[28]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Spotserve: Serving generative large language models on preemptible in- stances
Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhi- hao Jia. Spotserve: Serving generative large language models on preemptible in- stances. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume2, ASPLOS ’24, page 1112–1127, New York, NY , USA, ...
work page 2024
-
[30]
Aladdin: Joint placement and scaling for slo-aware llm serving
Chengyi Nie, Rodrigo Fonseca, and Zhenhua Liu. Aladdin: Joint placement and scaling for slo-aware llm serving. arXiv preprint arXiv:2405.06856, 2024
-
[31]
Splitwise: Efficient generative llm inference using phase split- ting
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase split- ting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132, 2024
work page 2024
-
[32]
Queue management for slo-oriented large language model serving
Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew Kalbarczyk, and Ravishankar Iyer. Queue management for slo-oriented large language model serving. In Proceedings of the 2024 ACM Symposium on Cloud Computing, SoCC ’24, page 18–35, New York, NY , USA, 2024. Association for Computing Machinery
work page 2024
-
[33]
Mooncake: A kvcache-centric disaggregated architecture for llm serving, 2024
Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: A kvcache-centric disaggregated architecture for llm serving. arXiv preprint arXiv:2407.00079, 2024. 26
-
[34]
Dynaserve: Unified and elastic execution for dynamic disaggregated llm serving, 2025
Chaoyi Ruan, Yinhe Chen, Dongqi Tian, Yandong Shi, Yongji Wu, Jialin Li, and Cheng Li. Dynaserve: Unified and elastic execution for dynamic disaggregated llm serving, 2025
work page 2025
-
[35]
S-lora: Serving thousands of concurrent lora adapters.arXiv preprint arXiv:2311.03285, 2023
Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christo- pher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, et al. S-lora: Serving thousands of concurrent lora adapters. arXiv preprint arXiv:2311.03285, 2023
-
[36]
Flexgen: High-throughput generative inference of large language models with a single gpu
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023
work page 2023
-
[37]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[38]
USHER: Holistic interference avoid- ance for resource optimized ML inference
Sudipta Saha Shubha, Haiying Shen, and Anand Iyer. USHER: Holistic interference avoid- ance for resource optimized ML inference. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 947–964, Santa Clara, CA, July
-
[39]
Llumnix: Dynamic scheduling for large language model serving
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic scheduling for large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 173–191, Santa Clara, CA, July 2024. USENIX Association
work page 2024
-
[40]
Scorpio: Serving the right requests at the right time for heterogeneous slos in llm inference, 2025
Yinghao Tang, Tingfeng Lan, Xiuqi Huang, Hui Lu, and Wei Chen. Scorpio: Serving the right requests at the right time for heterogeneous slos in llm inference, 2025
work page 2025
-
[41]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupati- raju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Clinical text summarization: adapting large language models can outper- form human experts
Dave Van Veen, Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, Eduardo Pontes Reis, Anna See- hofnerova, et al. Clinical text summarization: adapting large language models can outper- form human experts. Research Square, pages rs–3, 2023
work page 2023
-
[44]
vLLM Community. vllm ascend plugin. https://github.com/vllm-project/v llm-ascend, 2024. Accessed: 2025-06-05
work page 2024
-
[45]
Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism
Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems 27 Principles, SOSP ’24, page 640–654, New York, NY , USA, 2024. Association for Comput- ing Machinery
work page 2024
-
[46]
Unlock the potential of fine-grained llm serving via dynamic module scaling, 2025
Jingfeng Wu, Yiyuan He, Minxian Xu, Xitong Gao, Kejiang Ye, and Chengzhong Xu. Unlock the potential of fine-grained llm serving via dynamic module scaling, 2025
work page 2025
-
[47]
Arrow: Adaptive scheduling mechanisms for disaggregated llm inference architecture, 2025
Yu Wu, Tongxuan Liu, Yuting Zeng, Siyu Wu, Jun Xiong, Xianzhe Dong, Hailong Yang, Ke Zhang, and Jing Li. Arrow: Adaptive scheduling mechanisms for disaggregated llm inference architecture, 2025
work page 2025
-
[48]
Less is more for long document summary evaluation by LLMs.arXiv preprint arXiv:2309.07382, 2023
Yunshu Wu, Hayate Iso, Pouya Pezeshkpour, Nikita Bhutani, and Estevam Hruschka. Less is more for long document summary evaluation by LLMs.arXiv preprint arXiv:2309.07382, 2023
-
[49]
Kunpeng 920: The first 7-nm chiplet-based 64-core arm soc for cloud services
Jing Xia, Chuanning Cheng, Xiping Zhou, Yuxing Hu, and Peter Chun. Kunpeng 920: The first 7-nm chiplet-based 64-core arm soc for cloud services. IEEE Micro, 41(5):67–75, 2021
work page 2021
-
[50]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
{BlitzScale}: Fast and live large model autoscaling with o (1) host caching
Dingyan Zhang, Haotian Wang, Yang Liu, Xingda Wei, Yizhou Shan, Rong Chen, and Haibo Chen. {BlitzScale}: Fast and live large model autoscaling with o (1) host caching. In 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 275–293, 2025
work page 2025
-
[52]
Tempo: Application-aware llm serving with mixed slo requirements, 2025
Wei Zhang, Zhiyu Wu, Yi Mu, Banruo Liu, Myungjin Lee, and Fan Lai. Tempo: Application-aware llm serving with mixed slo requirements, 2025
work page 2025
-
[53]
Tempo: Application-aware llm serving with mixed slo requirements.arXiv preprint arXiv:2504.20068,
Wei Zhang, Zhiyu Wu, Yi Mu, Banruo Liu, Myungjin Lee, and Fan Lai. Tempo: Application-aware llm serving with mixed slo requirements. arXiv preprint arXiv:2504.20068, 2025
-
[54]
Lora land: 310 fine-tuned llms that rival gpt-4 — a technical report
Justin Zhao, Timothy Wang, Wael Abid, Geoffrey Angus, Arnav Garg, Jeffery Kinnison, Piero Molino, Travis Addair, and Devvret Rishi. Lora land: 310 fine-tuned llms that rival gpt-4 — a technical report. https://predibase.com/blog/lora-land-fine-t uned-open-source-llms-that-outperform-gpt-4, 2024
work page 2024
-
[55]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning
Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103, 2017. 28
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[57]
Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, OSDI’24, USA, 2024. USENIX Association
work page 2024
-
[58]
Squeezing operator performance potential for the ascend architecture
Yuhang Zhou, Zhibin Wang, Guyue Liu, Shipeng Li, Xi Lin, Zibo Wang, Yongzhong Wang, Fuchun Wei, Jingyi Zhang, Zhiheng Hu, et al. Squeezing operator performance potential for the ascend architecture. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume2, pages 1156–1171, 2025
work page 2025
-
[59]
Polyserve: Efficient multi-slo serving at scale, 2025
Kan Zhu, Haiyang Shi, Le Xu, Jiaxin Shan, Arvind Krishnamurthy, Baris Kasikci, and Liguang Xie. Polyserve: Efficient multi-slo serving at scale, 2025
work page 2025
-
[60]
Serving large language models on huawei cloudmatrix384, 2025
Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Diao, Weifeng Gao, Ke Xu, Zhangyu Chen, Shirui Lu, Zhao Qiu, Peiyang Li, Xianyu Chang, Zhengzhong Yu, Fangzheng Miao, Jia Zheng, Ying Li, Yuan Feng, Bei Wang, Zaijian Zong, Mosong Zhou, Wenli Zhou, Houjiang Chen, Xingyu Liao, Yipeng Li, Wenxiao Zhang, Ping Zhu, Yinggang Wang, Chuanjie Xiao...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.