HFX: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

Hong Chang; Jiannan Wang; Kan Chen; Morgan Lindsay Heisler; Niloofar Gholipour; Parham Yassini; Qiantao Zhang; Qian Wang; Taha Shabani; Xiaolong Bai

arxiv: 2508.15919 · v3 · submitted 2025-08-21 · 💻 cs.DC · cs.AI

HFX: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

Zahra Yousefijamarani , Xinglu Wang , Qian Wang , Morgan Lindsay Heisler , Taha Shabani , Niloofar Gholipour , Parham Yassini , Hong Chang

show 7 more authors

Kan Chen Qiantao Zhang Xiaolong Bai Jiannan Wang Ying Xiong Yong Zhang Zhenan Fan

This is my paper

Pith reviewed 2026-05-18 21:35 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords LLM servingSLO complianceelastic scalingrequest schedulingmulti-task workloadsNPU efficiencyD2D transfer

0 comments

The pith

HFX jointly optimizes LLM request scheduling and replica scaling to meet diverse SLOs while cutting latency and hardware costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that existing LLM serving methods fall short when facing mixed requests with different performance targets and changing loads. It introduces a scheduler that estimates resource budgets ahead of time and ranks requests to protect both new arrivals and those already running. A paired scaler moves model weights quickly between devices to start new copies with little delay and works with either combined or separated prefill and decode stages. Experiments on varied workloads report gains in meeting targets, faster overall response times, and reduced processor usage. A reader would care because production systems must balance strict guarantees for users against the expense of running large models at scale.

Core claim

HFX is a production LLM serving system that jointly optimizes request scheduling and elastic scaling across model replicas to satisfy diverse SLOs. The scheduler performs proactive budget estimation and prioritization to ensure compliance for both new and in-flight requests. The scaler supports fast device-to-device weight transfer to reduce cold-start latency and accommodates both colocated and disaggregated prefill/decode deployments.

What carries the argument

The integrated scheduler-scaler that pairs proactive budget estimation with fast D2D weight transfer for SLO-aware multi-replica management.

If this is right

Up to 4.44 times higher SLO attainment than current systems on multi-task workloads
Up to 65.82 percent reduction in end-to-end latency
Up to 49.81 percent lower NPU usage cost
Effective handling of both colocated and disaggregated prefill/decode configurations

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same joint approach might lower over-provisioning in cloud clusters that run many models at once
Similar scheduler-scaler pairings could be tested on other distributed AI workloads beyond LLMs
Extending the budget estimation to include network or memory limits might uncover further savings

Load-bearing premise

The multi-task workloads and hardware setups in the tests stand in for real production environments that have mixed requests, changing prompt lengths, and frequent scaling.

What would settle it

A production trace with request patterns or hardware that differ sharply from the test set shows no gain in SLO attainment or cost metrics when HFX is used instead of prior systems.

Figures

Figures reproduced from arXiv: 2508.15919 by Hong Chang, Jiannan Wang, Kan Chen, Morgan Lindsay Heisler, Niloofar Gholipour, Parham Yassini, Qiantao Zhang, Qian Wang, Taha Shabani, Xiaolong Bai, Xinglu Wang, Ying Xiong, Yong Zhang, Zahra Yousefijamarani, Zhenan Fan.

**Figure 3.** Figure 3: | Multi-task performance on 2-task and 4-task workloads. Metrics include SLO attainment (top, higher is better), end-to-end latency (middle, lower is better), and cost (bottom, lower is better) for HyperFlexis, RR, SCORPIO, and HyperFlexis-Scaling. Results use two workers, with up to four for scaling. 8. Evaluation 8.1. Multi-task Performance Evaluation 8.1.1. Collocated Architecture Results We first evalu… view at source ↗

read the original abstract

Large language model (LLM) serving faces the dual challenge of meeting strict user-specific service-level objectives (SLOs) while minimizing computational cost under dynamic, multi-task workloads. Existing approaches either rely on static scheduling policies or focus on single-task settings, limiting their applicability in real-world deployments with heterogeneous requests, variable prompt lengths, and elastic scaling requirements. We present HFX, a production LLM serving system that jointly optimizes request scheduling and elastic scaling across model replicas to satisfy diverse SLOs. HFX introduces a \textbf{scheduler} that performs proactive budget estimation and prioritization to ensure SLO compliance for both new and in-flight requests. HFX also integrates a \textbf{scaler} that supports fast device-to-device (D2D) weight transfer, reducing cold-start latency. Additionally, the system supports both colocated and disaggregated prefill/decode deployments, enabling adaptation to diverse workload patterns and cloud environments. Through extensive experiments on multi-task workloads, we demonstrate consistently higher SLO attainment, lower end-to-end latency, and lower NPU usage cost by up to 4.44$\times$, 65.82\%, and 49.81\%, respectively, compared to state-of-the-art systems. Our results highlight the effectiveness of SLO-aware scheduling and scaling in practical LLM serving, providing a robust framework for cost-efficient and SLO-compliant deployments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HFX combines proactive budget-based scheduling with D2D fast scaling in one system for multi-SLO LLM serving and shows measurable gains on mixed workloads.

read the letter

HFX is a serving system that tries to meet different user SLOs at once while cutting cost through joint scheduling and scaling. The scheduler estimates budgets ahead of time and prioritizes requests, and the scaler moves weights quickly between devices instead of reloading from scratch. It also lets operators choose colocated or disaggregated prefill/decode layouts depending on the workload mix. That combination is the concrete contribution here, and it is presented as a production-ready design rather than a new theoretical framework. The experiments report clear wins on multi-task traces: up to 4.44 times higher SLO attainment, 65 percent lower end-to-end latency, and 49 percent lower NPU cost versus the baselines they compare against. Those numbers are the kind of practical signal that operators actually track. The mechanisms themselves are described with enough detail to see how the pieces fit together, and there are no obvious internal contradictions in the algorithm sketches or the evaluation setup. The main soft spot is the usual one for systems papers: how representative the test workloads and hardware configurations really are. The paper treats the results as evidence for practical settings rather than universal claims, which keeps the assumption from becoming load-bearing. Still, anyone reproducing the work would want the exact workload generator and baseline configurations spelled out more explicitly. This is the kind of paper that matters to teams running LLM inference at scale who already deal with heterogeneous requests and elastic demand. A reader who needs ideas for production schedulers or scaling logic will find usable design points even if they end up changing some of the details. It is solid enough to deserve a serious referee who can press on the experimental controls and workload fidelity.

Referee Report

1 major / 1 minor

Summary. The manuscript presents HFX, a production LLM serving system that jointly optimizes request scheduling and elastic scaling to meet diverse user-specific SLOs under dynamic multi-task workloads. The scheduler performs proactive budget estimation and prioritization for both new and in-flight requests, while the scaler enables fast device-to-device (D2D) weight transfer to reduce cold-start latency. The system supports colocated and disaggregated prefill/decode deployments. Experiments on multi-task workloads claim up to 4.44× higher SLO attainment, 65.82% lower end-to-end latency, and 49.81% lower NPU usage cost relative to state-of-the-art systems.

Significance. If the experimental results hold under representative conditions, the work would be significant for distributed systems and LLM serving research. It offers a practical joint algorithm-system design for handling heterogeneous requests and elastic scaling, addressing real deployment challenges in cost-efficient, SLO-compliant serving. The explicit support for both colocated and disaggregated modes adds flexibility for varied cloud environments.

major comments (1)

[§5 (Experiments)] §5 (Experiments): The central claims rest on quantitative experimental results (up to 4.44× SLO attainment, 65.82% latency reduction, 49.81% cost reduction). The abstract and available description provide no details on the specific baselines, workload generation process for multi-task scenarios, hardware configurations, number of runs, or controls for confounds such as prompt length variation. These omissions make it difficult to assess whether the reported gains are load-bearing and reproducible.

minor comments (1)

[Abstract] Abstract: The inline LaTeX '4.44$×$' should be rendered as proper text (4.44×) in the final version for readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment on the experimental section below and commit to revisions that improve the clarity and reproducibility of our results.

read point-by-point responses

Referee: §5 (Experiments): The central claims rest on quantitative experimental results (up to 4.44× SLO attainment, 65.82% latency reduction, 49.81% cost reduction). The abstract and available description provide no details on the specific baselines, workload generation process for multi-task scenarios, hardware configurations, number of runs, or controls for confounds such as prompt length variation. These omissions make it difficult to assess whether the reported gains are load-bearing and reproducible.

Authors: We agree that the current presentation of experimental details could be strengthened for better reproducibility. In the revised manuscript, we will expand §5 to include: (1) explicit listing of all baselines with version numbers and configuration parameters; (2) a detailed description of the multi-task workload generator, including how request traces are synthesized from production logs with controlled distributions of prompt lengths, task types, and arrival rates; (3) hardware specifications (NPU models, interconnect, cluster sizes) and software stack; (4) the number of independent runs (we will report results from 5 runs with mean and standard deviation); and (5) explicit controls for prompt length variation, such as stratified sampling across length buckets and per-bucket performance breakdowns. These additions will directly address the concerns while preserving the existing experimental methodology. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an engineering system (HFX) for LLM serving with a scheduler for proactive budget estimation and a scaler for D2D weight transfer, evaluated via experiments on multi-task workloads. No equations, fitted parameters, predictions, or first-principles derivations are described that could reduce to their own inputs by construction. Performance claims rest on direct empirical comparisons to baselines rather than any self-referential logic, self-citation chains, or ansatz smuggling. The argument is self-contained as a practical systems contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities are identifiable from the abstract; the contribution is a systems artifact rather than a formal derivation.

pith-pipeline@v0.9.0 · 5835 in / 1160 out tokens · 47132 ms · 2026-05-18T21:35:11.903499+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 12 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Taming throughput-latency tradeoff in llm inference with sarathi-serve

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming throughput-latency tradeoff in llm inference with sarathi-serve. Proceedings of 18th USENIX Symposium on Operating Systems Design and Implementation, 2024, Santa Clara, 2024

work page 2024
[3]

Friedman, Thomas Williams, Ramesh K

Sohaib Ahmad, Hui Guan, Brian D. Friedman, Thomas Williams, Ramesh K. Sitara- man, and Thomas Woo. Proteus: A high-throughput inference-serving system with accu- racy scaling. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume1, ASPLOS ’24, page 318–334, New York, NY , ...

work page 2024
[4]

Ascend pytorch adapter ( torch_npu)

Ascend Community. Ascend pytorch adapter ( torch_npu). https://github.com/A scend/pytorch. Accessed: 2025-08-18

work page 2025
[5]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

LLMclean: Context-aware tabular data cleaning via LLM-generated OFDs

Fabian Biester, Mohamed Abdelaal, and Daniel Del Gaudio. LLMclean: Context-aware tabular data cleaning via LLM-generated OFDs. In European Conference on Advances in Databases and Information Systems, pages 68–78, Cham, Switzerland, 2024. Springer

work page 2024
[7]

Slos-serve: Optimized serving of multi-slo llms

Siyuan Chen, Zhipeng Jia, Samira Khan, Arvind Krishnamurthy, and Phillip B Gibbons. Slos-serve: Optimized serving of multi-slo llms. arXiv preprint arXiv:2504.08784, 2025

work page arXiv 2025
[8]

Siyuan Chen, Zhipeng Jia, Samira Khan, Arvind Krishnamurthy, and Phillip B. Gibbons. Slos-serve: Optimized serving of multi-slo llms, 2025

work page 2025
[9]

SCOOT: SLO- oriented performance tuning for LLM inference engines

Ke Cheng, Zhi Wang, Wen Hu, Tiannuo Yang, Jianguo Li, and Sheng Zhang. SCOOT: SLO- oriented performance tuning for LLM inference engines. In THE WEB CONFERENCE 2025, 2025

work page 2025
[10]

Elis: Efficient llm iterative scheduling system with response length predictor, 2025

Seungbeom Choi, Jeonghoe Goo, Eunjoo Jeon, Mingyu Yang, and Minsung Jang. Elis: Efficient llm iterative scheduling system with response length predictor, 2025

work page 2025
[11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Cmekg_tools: Chinese medical knowledge graph tools

CMeKG_tools Contributors. Cmekg_tools: Chinese medical knowledge graph tools. https://github.com/king-yyf/CMeKG_tools, 2023

work page 2023
[13]

Optimizing slo-oriented llm serving with pd-multiplexing, 2025

Weihao Cui, Yukang Chen, Han Zhao, Ziyi Xu, Quan Chen, Xusheng Chen, Yangjie Zhou, Shixuan Sun, and Minyi Guo. Optimizing slo-oriented llm serving with pd-multiplexing, 2025. 24

work page 2025
[14]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Apt-serve: Adaptive request scheduling on hybrid cache for scalable llm inference serving

Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. Apt-serve: Adaptive request scheduling on hybrid cache for scalable llm inference serving. Proc. ACM Manag. Data, 3(3), June 2025

work page 2025
[16]

Has-gpu: Efficient hybrid auto-scaling with fine-grained gpu allocation for slo-aware serverless inferences, 2025

Jianfeng Gu, Puxuan Wang, Isaac Nunezand, Kai Huang, and Michael Gerndt. Has-gpu: Efficient hybrid auto-scaling with fine-grained gpu allocation for slo-aware serverless inferences, 2025

work page 2025
[17]

Inference without interference: Disaggregate llm inference for mixed downstream workloads, 2024

Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. Inference without interference: Disaggregate llm inference for mixed downstream workloads, 2024

work page 2024
[18]

Slo-aware scheduling for large language model inferences, 2025

Jinqi Huang, Yi Xiong, Xuebing Yu, Wenjie Huang, Entong Li, Li Zeng, and Xin Chen. Slo-aware scheduling for large language model inferences, 2025

work page 2025
[19]

Ascend community hardware documentation

Huawei Technologies Co., Ltd. Ascend community hardware documentation. https: //www.hiascend.com/en/document?tag=hardware, 2025

work page 2025
[20]

A Survey on Large Language Models for Code Generation

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation. arXiv preprint arXiv:2406.00515, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[22]

Llm inference serving: Survey of recent advances and opportunities

Baolin Li, Yankai Jiang, Vijay Gadepally, and Devesh Tiwari. Llm inference serving: Survey of recent advances and opportunities. In 2024 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–8, 2024. 25

work page 2024
[23]

John, and Neeraja J

Ruihao Li, Shagnik Pal, Vineeth Narayan Pullu, Prasoon Sinha, Jeeho Ryoo, Lizy K. John, and Neeraja J. Yadwadkar. Mirage: Kv cache optimization through parameter remapping for multi-tenant llm serving, 2025

work page 2025
[24]

Ascend AI Processor Architecture and Programming: Principles and Applications of CANN

Xiaoyao Liang. Ascend AI Processor Architecture and Programming: Principles and Applications of CANN. Elsevier, 2020

work page 2020
[25]

Ascend: a scalable and unified architecture for ubiquitous deep neural network computing : Industry track paper

Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. Ascend: a scalable and unified architecture for ubiquitous deep neural network computing : Industry track paper. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 789–801, 2021

work page 2021
[26]

Ascend: a scalable and unified architecture for ubiquitous deep neural network comput- ing: Industry track paper

Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. Ascend: a scalable and unified architecture for ubiquitous deep neural network comput- ing: Industry track paper. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 789–801. IEEE, 2021

work page 2021
[27]

Davinci: A scalable architecture for neural network computing

Heng Liao, Jiajin Tu, Jing Xia, and Xiping Zhou. Davinci: A scalable architecture for neural network computing. In 2019 IEEE Hot Chips 31 Symposium (HCS), pages 1–44. IEEE Computer Society, 2019

work page 2019
[28]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Spotserve: Serving generative large language models on preemptible in- stances

Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhi- hao Jia. Spotserve: Serving generative large language models on preemptible in- stances. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume2, ASPLOS ’24, page 1112–1127, New York, NY , USA, ...

work page 2024
[30]

Aladdin: Joint placement and scaling for slo-aware llm serving

Chengyi Nie, Rodrigo Fonseca, and Zhenhua Liu. Aladdin: Joint placement and scaling for slo-aware llm serving. arXiv preprint arXiv:2405.06856, 2024

work page arXiv 2024
[31]

Splitwise: Efficient generative llm inference using phase split- ting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase split- ting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132, 2024

work page 2024
[32]

Queue management for slo-oriented large language model serving

Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew Kalbarczyk, and Ravishankar Iyer. Queue management for slo-oriented large language model serving. In Proceedings of the 2024 ACM Symposium on Cloud Computing, SoCC ’24, page 18–35, New York, NY , USA, 2024. Association for Computing Machinery

work page 2024
[33]

Mooncake: A kvcache-centric disaggregated architecture for llm serving, 2024

Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: A kvcache-centric disaggregated architecture for llm serving. arXiv preprint arXiv:2407.00079, 2024. 26

work page arXiv 2024
[34]

Dynaserve: Unified and elastic execution for dynamic disaggregated llm serving, 2025

Chaoyi Ruan, Yinhe Chen, Dongqi Tian, Yandong Shi, Yongji Wu, Jialin Li, and Cheng Li. Dynaserve: Unified and elastic execution for dynamic disaggregated llm serving, 2025

work page 2025
[35]

S-lora: Serving thousands of concurrent lora adapters.arXiv preprint arXiv:2311.03285, 2023

Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christo- pher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, et al. S-lora: Serving thousands of concurrent lora adapters. arXiv preprint arXiv:2311.03285, 2023

work page arXiv 2023
[36]

Flexgen: High-throughput generative inference of large language models with a single gpu

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023

work page 2023
[37]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[38]

USHER: Holistic interference avoid- ance for resource optimized ML inference

Sudipta Saha Shubha, Haiying Shen, and Anand Iyer. USHER: Holistic interference avoid- ance for resource optimized ML inference. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 947–964, Santa Clara, CA, July

work page
[39]

Llumnix: Dynamic scheduling for large language model serving

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic scheduling for large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 173–191, Santa Clara, CA, July 2024. USENIX Association

work page 2024
[40]

Scorpio: Serving the right requests at the right time for heterogeneous slos in llm inference, 2025

Yinghao Tang, Tingfeng Lan, Xiuqi Huang, Hui Lu, and Wei Chen. Scorpio: Serving the right requests at the right time for heterogeneous slos in llm inference, 2025

work page 2025
[41]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupati- raju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Clinical text summarization: adapting large language models can outper- form human experts

Dave Van Veen, Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, Eduardo Pontes Reis, Anna See- hofnerova, et al. Clinical text summarization: adapting large language models can outper- form human experts. Research Square, pages rs–3, 2023

work page 2023
[44]

vllm ascend plugin

vLLM Community. vllm ascend plugin. https://github.com/vllm-project/v llm-ascend, 2024. Accessed: 2025-06-05

work page 2024
[45]

Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems 27 Principles, SOSP ’24, page 640–654, New York, NY , USA, 2024. Association for Comput- ing Machinery

work page 2024
[46]

Unlock the potential of fine-grained llm serving via dynamic module scaling, 2025

Jingfeng Wu, Yiyuan He, Minxian Xu, Xitong Gao, Kejiang Ye, and Chengzhong Xu. Unlock the potential of fine-grained llm serving via dynamic module scaling, 2025

work page 2025
[47]

Arrow: Adaptive scheduling mechanisms for disaggregated llm inference architecture, 2025

Yu Wu, Tongxuan Liu, Yuting Zeng, Siyu Wu, Jun Xiong, Xianzhe Dong, Hailong Yang, Ke Zhang, and Jing Li. Arrow: Adaptive scheduling mechanisms for disaggregated llm inference architecture, 2025

work page 2025
[48]

Less is more for long document summary evaluation by LLMs.arXiv preprint arXiv:2309.07382, 2023

Yunshu Wu, Hayate Iso, Pouya Pezeshkpour, Nikita Bhutani, and Estevam Hruschka. Less is more for long document summary evaluation by LLMs.arXiv preprint arXiv:2309.07382, 2023

work page arXiv 2023
[49]

Kunpeng 920: The first 7-nm chiplet-based 64-core arm soc for cloud services

Jing Xia, Chuanning Cheng, Xiping Zhou, Yuxing Hu, and Peter Chun. Kunpeng 920: The first 7-nm chiplet-based 64-core arm soc for cloud services. IEEE Micro, 41(5):67–75, 2021

work page 2021
[50]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

{BlitzScale}: Fast and live large model autoscaling with o (1) host caching

Dingyan Zhang, Haotian Wang, Yang Liu, Xingda Wei, Yizhou Shan, Rong Chen, and Haibo Chen. {BlitzScale}: Fast and live large model autoscaling with o (1) host caching. In 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 275–293, 2025

work page 2025
[52]

Tempo: Application-aware llm serving with mixed slo requirements, 2025

Wei Zhang, Zhiyu Wu, Yi Mu, Banruo Liu, Myungjin Lee, and Fan Lai. Tempo: Application-aware llm serving with mixed slo requirements, 2025

work page 2025
[53]

Tempo: Application-aware llm serving with mixed slo requirements.arXiv preprint arXiv:2504.20068,

Wei Zhang, Zhiyu Wu, Yi Mu, Banruo Liu, Myungjin Lee, and Fan Lai. Tempo: Application-aware llm serving with mixed slo requirements. arXiv preprint arXiv:2504.20068, 2025

work page arXiv 2025
[54]

Lora land: 310 fine-tuned llms that rival gpt-4 — a technical report

Justin Zhao, Timothy Wang, Wael Abid, Geoffrey Angus, Arnav Garg, Jeffery Kinnison, Piero Molino, Travis Addair, and Devvret Rishi. Lora land: 310 fine-tuned llms that rival gpt-4 — a technical report. https://predibase.com/blog/lora-land-fine-t uned-open-source-llms-that-outperform-gpt-4, 2024

work page 2024
[55]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103, 2017. 28

work page internal anchor Pith review Pith/arXiv arXiv 2017
[57]

Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, OSDI’24, USA, 2024. USENIX Association

work page 2024
[58]

Squeezing operator performance potential for the ascend architecture

Yuhang Zhou, Zhibin Wang, Guyue Liu, Shipeng Li, Xi Lin, Zibo Wang, Yongzhong Wang, Fuchun Wei, Jingyi Zhang, Zhiheng Hu, et al. Squeezing operator performance potential for the ascend architecture. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume2, pages 1156–1171, 2025

work page 2025
[59]

Polyserve: Efficient multi-slo serving at scale, 2025

Kan Zhu, Haiyang Shi, Le Xu, Jiaxin Shan, Arvind Krishnamurthy, Baris Kasikci, and Liguang Xie. Polyserve: Efficient multi-slo serving at scale, 2025

work page 2025
[60]

Serving large language models on huawei cloudmatrix384, 2025

Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Diao, Weifeng Gao, Ke Xu, Zhangyu Chen, Shirui Lu, Zhao Qiu, Peiyang Li, Xianyu Chang, Zhengzhong Yu, Fangzheng Miao, Jia Zheng, Ying Li, Yuan Feng, Bei Wang, Zaijian Zong, Mosong Zhou, Wenli Zhou, Houjiang Chen, Xingyu Liao, Yipeng Li, Wenxiao Zhang, Ping Zhu, Yinggang Wang, Chuanjie Xiao...

work page 2025

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Taming throughput-latency tradeoff in llm inference with sarathi-serve

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming throughput-latency tradeoff in llm inference with sarathi-serve. Proceedings of 18th USENIX Symposium on Operating Systems Design and Implementation, 2024, Santa Clara, 2024

work page 2024

[3] [3]

Friedman, Thomas Williams, Ramesh K

Sohaib Ahmad, Hui Guan, Brian D. Friedman, Thomas Williams, Ramesh K. Sitara- man, and Thomas Woo. Proteus: A high-throughput inference-serving system with accu- racy scaling. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume1, ASPLOS ’24, page 318–334, New York, NY , ...

work page 2024

[4] [4]

Ascend pytorch adapter ( torch_npu)

Ascend Community. Ascend pytorch adapter ( torch_npu). https://github.com/A scend/pytorch. Accessed: 2025-08-18

work page 2025

[5] [5]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

LLMclean: Context-aware tabular data cleaning via LLM-generated OFDs

Fabian Biester, Mohamed Abdelaal, and Daniel Del Gaudio. LLMclean: Context-aware tabular data cleaning via LLM-generated OFDs. In European Conference on Advances in Databases and Information Systems, pages 68–78, Cham, Switzerland, 2024. Springer

work page 2024

[7] [7]

Slos-serve: Optimized serving of multi-slo llms

Siyuan Chen, Zhipeng Jia, Samira Khan, Arvind Krishnamurthy, and Phillip B Gibbons. Slos-serve: Optimized serving of multi-slo llms. arXiv preprint arXiv:2504.08784, 2025

work page arXiv 2025

[8] [8]

Siyuan Chen, Zhipeng Jia, Samira Khan, Arvind Krishnamurthy, and Phillip B. Gibbons. Slos-serve: Optimized serving of multi-slo llms, 2025

work page 2025

[9] [9]

SCOOT: SLO- oriented performance tuning for LLM inference engines

Ke Cheng, Zhi Wang, Wen Hu, Tiannuo Yang, Jianguo Li, and Sheng Zhang. SCOOT: SLO- oriented performance tuning for LLM inference engines. In THE WEB CONFERENCE 2025, 2025

work page 2025

[10] [10]

Elis: Efficient llm iterative scheduling system with response length predictor, 2025

Seungbeom Choi, Jeonghoe Goo, Eunjoo Jeon, Mingyu Yang, and Minsung Jang. Elis: Efficient llm iterative scheduling system with response length predictor, 2025

work page 2025

[11] [11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[12] [12]

Cmekg_tools: Chinese medical knowledge graph tools

CMeKG_tools Contributors. Cmekg_tools: Chinese medical knowledge graph tools. https://github.com/king-yyf/CMeKG_tools, 2023

work page 2023

[13] [13]

Optimizing slo-oriented llm serving with pd-multiplexing, 2025

Weihao Cui, Yukang Chen, Han Zhao, Ziyi Xu, Quan Chen, Xusheng Chen, Yangjie Zhou, Shixuan Sun, and Minyi Guo. Optimizing slo-oriented llm serving with pd-multiplexing, 2025. 24

work page 2025

[14] [14]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Apt-serve: Adaptive request scheduling on hybrid cache for scalable llm inference serving

Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. Apt-serve: Adaptive request scheduling on hybrid cache for scalable llm inference serving. Proc. ACM Manag. Data, 3(3), June 2025

work page 2025

[16] [16]

Has-gpu: Efficient hybrid auto-scaling with fine-grained gpu allocation for slo-aware serverless inferences, 2025

Jianfeng Gu, Puxuan Wang, Isaac Nunezand, Kai Huang, and Michael Gerndt. Has-gpu: Efficient hybrid auto-scaling with fine-grained gpu allocation for slo-aware serverless inferences, 2025

work page 2025

[17] [17]

Inference without interference: Disaggregate llm inference for mixed downstream workloads, 2024

Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. Inference without interference: Disaggregate llm inference for mixed downstream workloads, 2024

work page 2024

[18] [18]

Slo-aware scheduling for large language model inferences, 2025

Jinqi Huang, Yi Xiong, Xuebing Yu, Wenjie Huang, Entong Li, Li Zeng, and Xin Chen. Slo-aware scheduling for large language model inferences, 2025

work page 2025

[19] [19]

Ascend community hardware documentation

Huawei Technologies Co., Ltd. Ascend community hardware documentation. https: //www.hiascend.com/en/document?tag=hardware, 2025

work page 2025

[20] [20]

A Survey on Large Language Models for Code Generation

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation. arXiv preprint arXiv:2406.00515, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023

[22] [22]

Llm inference serving: Survey of recent advances and opportunities

Baolin Li, Yankai Jiang, Vijay Gadepally, and Devesh Tiwari. Llm inference serving: Survey of recent advances and opportunities. In 2024 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–8, 2024. 25

work page 2024

[23] [23]

John, and Neeraja J

Ruihao Li, Shagnik Pal, Vineeth Narayan Pullu, Prasoon Sinha, Jeeho Ryoo, Lizy K. John, and Neeraja J. Yadwadkar. Mirage: Kv cache optimization through parameter remapping for multi-tenant llm serving, 2025

work page 2025

[24] [24]

Ascend AI Processor Architecture and Programming: Principles and Applications of CANN

Xiaoyao Liang. Ascend AI Processor Architecture and Programming: Principles and Applications of CANN. Elsevier, 2020

work page 2020

[25] [25]

Ascend: a scalable and unified architecture for ubiquitous deep neural network computing : Industry track paper

Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. Ascend: a scalable and unified architecture for ubiquitous deep neural network computing : Industry track paper. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 789–801, 2021

work page 2021

[26] [26]

Ascend: a scalable and unified architecture for ubiquitous deep neural network comput- ing: Industry track paper

Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. Ascend: a scalable and unified architecture for ubiquitous deep neural network comput- ing: Industry track paper. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 789–801. IEEE, 2021

work page 2021

[27] [27]

Davinci: A scalable architecture for neural network computing

Heng Liao, Jiajin Tu, Jing Xia, and Xiping Zhou. Davinci: A scalable architecture for neural network computing. In 2019 IEEE Hot Chips 31 Symposium (HCS), pages 1–44. IEEE Computer Society, 2019

work page 2019

[28] [28]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Spotserve: Serving generative large language models on preemptible in- stances

Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhi- hao Jia. Spotserve: Serving generative large language models on preemptible in- stances. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume2, ASPLOS ’24, page 1112–1127, New York, NY , USA, ...

work page 2024

[30] [30]

Aladdin: Joint placement and scaling for slo-aware llm serving

Chengyi Nie, Rodrigo Fonseca, and Zhenhua Liu. Aladdin: Joint placement and scaling for slo-aware llm serving. arXiv preprint arXiv:2405.06856, 2024

work page arXiv 2024

[31] [31]

Splitwise: Efficient generative llm inference using phase split- ting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase split- ting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132, 2024

work page 2024

[32] [32]

Queue management for slo-oriented large language model serving

Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew Kalbarczyk, and Ravishankar Iyer. Queue management for slo-oriented large language model serving. In Proceedings of the 2024 ACM Symposium on Cloud Computing, SoCC ’24, page 18–35, New York, NY , USA, 2024. Association for Computing Machinery

work page 2024

[33] [33]

Mooncake: A kvcache-centric disaggregated architecture for llm serving, 2024

Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: A kvcache-centric disaggregated architecture for llm serving. arXiv preprint arXiv:2407.00079, 2024. 26

work page arXiv 2024

[34] [34]

Dynaserve: Unified and elastic execution for dynamic disaggregated llm serving, 2025

Chaoyi Ruan, Yinhe Chen, Dongqi Tian, Yandong Shi, Yongji Wu, Jialin Li, and Cheng Li. Dynaserve: Unified and elastic execution for dynamic disaggregated llm serving, 2025

work page 2025

[35] [35]

S-lora: Serving thousands of concurrent lora adapters.arXiv preprint arXiv:2311.03285, 2023

Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christo- pher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, et al. S-lora: Serving thousands of concurrent lora adapters. arXiv preprint arXiv:2311.03285, 2023

work page arXiv 2023

[36] [36]

Flexgen: High-throughput generative inference of large language models with a single gpu

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023

work page 2023

[37] [37]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[38] [38]

USHER: Holistic interference avoid- ance for resource optimized ML inference

Sudipta Saha Shubha, Haiying Shen, and Anand Iyer. USHER: Holistic interference avoid- ance for resource optimized ML inference. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 947–964, Santa Clara, CA, July

work page

[39] [39]

Llumnix: Dynamic scheduling for large language model serving

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic scheduling for large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 173–191, Santa Clara, CA, July 2024. USENIX Association

work page 2024

[40] [40]

Scorpio: Serving the right requests at the right time for heterogeneous slos in llm inference, 2025

Yinghao Tang, Tingfeng Lan, Xiuqi Huang, Hui Lu, and Wei Chen. Scorpio: Serving the right requests at the right time for heterogeneous slos in llm inference, 2025

work page 2025

[41] [41]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupati- raju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Clinical text summarization: adapting large language models can outper- form human experts

Dave Van Veen, Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, Eduardo Pontes Reis, Anna See- hofnerova, et al. Clinical text summarization: adapting large language models can outper- form human experts. Research Square, pages rs–3, 2023

work page 2023

[44] [44]

vllm ascend plugin

vLLM Community. vllm ascend plugin. https://github.com/vllm-project/v llm-ascend, 2024. Accessed: 2025-06-05

work page 2024

[45] [45]

Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems 27 Principles, SOSP ’24, page 640–654, New York, NY , USA, 2024. Association for Comput- ing Machinery

work page 2024

[46] [46]

Unlock the potential of fine-grained llm serving via dynamic module scaling, 2025

Jingfeng Wu, Yiyuan He, Minxian Xu, Xitong Gao, Kejiang Ye, and Chengzhong Xu. Unlock the potential of fine-grained llm serving via dynamic module scaling, 2025

work page 2025

[47] [47]

Arrow: Adaptive scheduling mechanisms for disaggregated llm inference architecture, 2025

Yu Wu, Tongxuan Liu, Yuting Zeng, Siyu Wu, Jun Xiong, Xianzhe Dong, Hailong Yang, Ke Zhang, and Jing Li. Arrow: Adaptive scheduling mechanisms for disaggregated llm inference architecture, 2025

work page 2025

[48] [48]

Less is more for long document summary evaluation by LLMs.arXiv preprint arXiv:2309.07382, 2023

Yunshu Wu, Hayate Iso, Pouya Pezeshkpour, Nikita Bhutani, and Estevam Hruschka. Less is more for long document summary evaluation by LLMs.arXiv preprint arXiv:2309.07382, 2023

work page arXiv 2023

[49] [49]

Kunpeng 920: The first 7-nm chiplet-based 64-core arm soc for cloud services

Jing Xia, Chuanning Cheng, Xiping Zhou, Yuxing Hu, and Peter Chun. Kunpeng 920: The first 7-nm chiplet-based 64-core arm soc for cloud services. IEEE Micro, 41(5):67–75, 2021

work page 2021

[50] [50]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

{BlitzScale}: Fast and live large model autoscaling with o (1) host caching

Dingyan Zhang, Haotian Wang, Yang Liu, Xingda Wei, Yizhou Shan, Rong Chen, and Haibo Chen. {BlitzScale}: Fast and live large model autoscaling with o (1) host caching. In 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 275–293, 2025

work page 2025

[52] [52]

Tempo: Application-aware llm serving with mixed slo requirements, 2025

Wei Zhang, Zhiyu Wu, Yi Mu, Banruo Liu, Myungjin Lee, and Fan Lai. Tempo: Application-aware llm serving with mixed slo requirements, 2025

work page 2025

[53] [53]

Tempo: Application-aware llm serving with mixed slo requirements.arXiv preprint arXiv:2504.20068,

Wei Zhang, Zhiyu Wu, Yi Mu, Banruo Liu, Myungjin Lee, and Fan Lai. Tempo: Application-aware llm serving with mixed slo requirements. arXiv preprint arXiv:2504.20068, 2025

work page arXiv 2025

[54] [54]

Lora land: 310 fine-tuned llms that rival gpt-4 — a technical report

Justin Zhao, Timothy Wang, Wael Abid, Geoffrey Angus, Arnav Garg, Jeffery Kinnison, Piero Molino, Travis Addair, and Devvret Rishi. Lora land: 310 fine-tuned llms that rival gpt-4 — a technical report. https://predibase.com/blog/lora-land-fine-t uned-open-source-llms-that-outperform-gpt-4, 2024

work page 2024

[55] [55]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103, 2017. 28

work page internal anchor Pith review Pith/arXiv arXiv 2017

[57] [57]

Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, OSDI’24, USA, 2024. USENIX Association

work page 2024

[58] [58]

Squeezing operator performance potential for the ascend architecture

Yuhang Zhou, Zhibin Wang, Guyue Liu, Shipeng Li, Xi Lin, Zibo Wang, Yongzhong Wang, Fuchun Wei, Jingyi Zhang, Zhiheng Hu, et al. Squeezing operator performance potential for the ascend architecture. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume2, pages 1156–1171, 2025

work page 2025

[59] [59]

Polyserve: Efficient multi-slo serving at scale, 2025

Kan Zhu, Haiyang Shi, Le Xu, Jiaxin Shan, Arvind Krishnamurthy, Baris Kasikci, and Liguang Xie. Polyserve: Efficient multi-slo serving at scale, 2025

work page 2025

[60] [60]

Serving large language models on huawei cloudmatrix384, 2025

Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Diao, Weifeng Gao, Ke Xu, Zhangyu Chen, Shirui Lu, Zhao Qiu, Peiyang Li, Xianyu Chang, Zhengzhong Yu, Fangzheng Miao, Jia Zheng, Ying Li, Yuan Feng, Bei Wang, Zaijian Zong, Mosong Zhou, Wenli Zhou, Houjiang Chen, Xingyu Liao, Yipeng Li, Wenxiao Zhang, Ping Zhu, Yinggang Wang, Chuanjie Xiao...

work page 2025