Compass: SLO-aware Query Planner for Compound AI Serving at Scale

Banruo Liu; Fan Lai; Minghao Fang; Wei-Yu Lin; Yihan Jiang

arxiv: 2504.16397 · v2 · pith:KBKUF5NMnew · submitted 2025-04-23 · 💻 cs.DB · cs.LG

Compass: SLO-aware Query Planner for Compound AI Serving at Scale

Banruo Liu , Wei-Yu Lin , Minghao Fang , Yihan Jiang , Fan Lai This is my paper

Pith reviewed 2026-05-22 19:09 UTC · model grok-4.3

classification 💻 cs.DB cs.LG

keywords compound AI servingSLO-aware planningquery plannerservice level objectivesplan decompositionbipartite matchingresource allocationgoodput optimization

0 comments

The pith

Compass decomposes multi-SLO planning for compound AI systems into tractable subproblems while preserving global decision quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Compound AI applications run pipelines of operators across cloud and edge to meet latency, accuracy, and cost goals for many users at once. The space of possible placements, configurations, and resource choices explodes when queries compete for shared hardware and each has its own SLO targets. Compass tackles the explosion by breaking the overall planning task into smaller subproblems that reuse similar plans across queries, adding selective profiling for accurate estimates without full measurement, and using bipartite matching at runtime to assign plans under contention. If this works, operators can run responsive, cost-effective compound AI services at scale instead of over-provisioning or accepting missed targets.

Core claim

Compass is the first SLO-aware query planner that optimizes large-scale compound AI workloads across diverse deployment spaces. It decomposes the many-query, multi-SLO planning problem into tractable subproblems while preserving global decision quality, exploits plan similarities within and across queries to reduce search steps, improves per-step efficiency with selective profiling for high-fidelity estimates, and applies query-plan bipartite matching at runtime to maximize SLO goodput under resource contentions.

What carries the argument

Decomposition of the many-query multi-SLO planning problem into subproblems that exploit plan similarities, paired with selective plan profiling and runtime bipartite matching.

Load-bearing premise

Decomposing the many-query multi-SLO planning problem into tractable subproblems while exploiting plan similarities preserves global decision quality.

What would settle it

A deployment trace with heterogeneous device speeds and hundreds of concurrent queries where measured goodput stays within 10 percent of a baseline planner or planning latency exceeds a few seconds.

Figures

Figures reproduced from arXiv: 2504.16397 by Banruo Liu, Fan Lai, Minghao Fang, Wei-Yu Lin, Yihan Jiang.

**Figure 1.** Figure 1: As compound AI services increasingly cater to end users, their deployment can span multi-tier, heterogeneous infrastructure. • We propose a new search mechanism that effectively decomposes global planning, leveraging plan similarities to search efficiently with imprecise profiling information. • We evaluate Circinus in various real-world settings, showing significant SLO improvements and cost savings. 2 … view at source ↗

**Figure 2.** Figure 2: Edge devices exhibit (a) heterogeneous computational and communication speeds and (b) diverse data distributions, leading to accuracy variance for the same plan across users. “Low-high” refers to using resource-light for operator 1 and resource-intensive configurations for operator 2. introduces unique systems challenges due to its user-centric nature and spans across various stages and infrastructure tie… view at source ↗

**Figure 4.** Figure 4: (a) State-of-the-art query planners suffer from slow response times to find the cost-effective plan, and (b) are insufficient for multi-query global planning. utilization and SLO goodput [99]. However, burst workloads, resource contention, and performance fluctuations (e.g., due to network dynamics) can lead to dynamic tensions between preserving SLO and lowering deployment costs in the wild. 2.3 Limitati… view at source ↗

**Figure 5.** Figure 5: Circinus overview for compound AI serving at scale. pipelines (e.g., through shared machines or models) while accounting for resource contention. To address this complex multi-query planning challenge, Circinus decomposes the problem into three core system components ( [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Circinus optimizes single-query search by reducing per search step cost (§4.1) and the number of search steps (§4.2). namics are detected, the resource orchestrator can trigger a replanning phase. 4 Circinus Design While decomposing the multi-query planning challenge improves scalability and adaptability—for example, allowing performance fluctuations within a single query pipeline to be addressed in isola… view at source ↗

**Figure 8.** Figure 8: The prediction accuracy of fl . Left: CMBO are able to accurately predict latency while normal BO introduces significant error. Right: Circinus utilizes history model to achieve warm start, and switches back to CMBO model when it matures. predicts the accuracy of a plan given its configuration, while fl predicts the latency of the plan. Circinus proposes new plans that maximize the acquisition function (ut… view at source ↗

**Figure 9.** Figure 9: An example of Circinus API for visual tracking task. ing to the Cloud IoT platform for easing last-hop management. Following existing advances [27, 104], we maintain standby machines to minimize latency. Fault Tolerance. To ensure reliability, Circinus periodically and asynchronously records profiled configurations to a checkpoint file. In case of failure, Circinus can resume execution from the last saved… view at source ↗

**Figure 10.** Figure 10: In resource-constrained deployments, Circinus improves service goodput across scales and applications, achieving performance close to the ILP optimal. It outperforms baselines that require 3× longer response times—15 seconds render real-time service impractical. • Low response time: the time required to generate the first feasible query plan that satisfies the user’s SLO requirements, ensuring better ser… view at source ↗

**Figure 11.** Figure 11: In cost-constrained deployment, Circinus reduces resource monetary costs in supporting all queries. Speech Reco. L. Video Chat Visual Tracking 0 10 20 30 40 Median Resp. Time (s) Circinus Vulcan VideoStorm [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: Circinus improves query responsiveness in single-query planning. Error bars show 25% - 75% percentile. Circinus reduces deployment costs. We next evaluate the resource-abundant deployment scenario (e.g., utilizing cloud scaling services). We use the price on Google Cloud Platform as the monetary cost. For latency-critical tasks, we report the real-time normalized monetary cost over 60 minutes; for through… view at source ↗

**Figure 13.** Figure 13: Circinus reduces planning (profiling) costs for throughput-critical applications, in A100 GPU hours. Visual Track. A. Code Gen. 0 2 4 6 Improvement Factor 3.0 5.6 1.8 4.7 1.8 4.5 1.7 4.2 Circinus w/o SLO Profiler w/o Search Opti. w/o Multi. Sche. (a) End-to-end Perf. Breakdown. SR VT VQA DVC ACG 1 3 5 Planning Time (s) Other Search Time Profiling Time 0 2 4 6 Planning Time (hr) (b) Planning Runtime Breakd… view at source ↗

**Figure 15.** Figure 15: Circinus’s performance in a wide range of settings. CMBO, which requires only 40-100 ms per iteration, leaving the majority of the budget available for intensive profiling. 6.4 Ablation Study Impact of System Loads. We next study the impact of system loads on Circinus performance. We deploy a small-scale visual tracking task and report the service goodput. We define the relative system load (query rate) … view at source ↗

**Figure 16.** Figure 16: Circinus outperforms baselines across different tiers SLO Violation (a) Latency Drift. SLO Violation (b) Accuracy Drift [PITH_FULL_IMAGE:figures/full_fig_p012_16.png] view at source ↗

**Figure 17.** Figure 17: Circinus handles runtime dynamics efficiently. better similarity-based warm start. However, using too many (e.g., 100) introduces early-stage overhead, as Circinus must evaluate and select the most relevant histories, and perform hundreds of CMBO predictions. Our results show that using around 10 historical models strikes a good balance between efficiency and overhead, and we use this setting as the defau… view at source ↗

**Figure 18.** Figure 18: Encoding configuration into discrete one-hot features sometimes outperforms encoding configuration into continuous variables for BO models in our task. D Latency Estimation It should be noted that the latency of a pipeline typically depends on the input length and output length (LLM decoding) or neither of them (streaming). Here, we assume the latency in SLO is in a normalized form that makes sense in it… view at source ↗

**Figure 19.** Figure 19: Speech Recognition (small) 0 30 60 Time (minutes) 0 20 40 60 80 100 Service Goodput Circinus VideoStorm VideoStorm (15s) Vulcan Vulcan (15s) Circinus (ILP) [PITH_FULL_IMAGE:figures/full_fig_p022_19.png] view at source ↗

**Figure 20.** Figure 20: Visual Tracking (small) Large Scale Limited Resource 0 30 60 Time (minutes) 0 200 400 600 800 1000 1200 1400 1600 Service Goodput Circinus VideoStorm VideoStorm (15s) Vulcan Vulcan (15s) [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗

**Figure 21.** Figure 21: Visual Tracking (large) Unlimited Resource Deployment Cost 0 200 400 600 Time (minutes) 0 1 2 3 4 Normalized Cost Circinus VideoStorm (15s) Vulcan (15s) [PITH_FULL_IMAGE:figures/full_fig_p022_21.png] view at source ↗

read the original abstract

The rise of compound AI serving that integrates multiple operators in a pipeline enables end-user applications such as generative AI-powered meeting companions, autonomous driving, and immersive gaming. These workloads span diverse deployment spaces, from cloud-only queries to edge-assisted ones across infrastructure tiers, often including both within an application. Achieving high service goodput -- i.e., meeting service level objectives (SLOs) for pipeline latency, accuracy, and costs -- requires joint planning of operators' placement, configuration, and resource allocation. However, diverse SLOs, varying runtime environments (e.g., heterogeneous device speeds), and a large volume of queries competing for shared infrastructure explode the planning space, making real-time serving and cost-efficient deployment intractable with existing advances. This paper presents Compass, the first SLO-aware query planner that optimizes large-scale compound AI workloads across diverse deployment spaces. Compass decomposes the many-query, multi-SLO planning problem into tractable subproblems while preserving global decision quality, exploiting plan similarities within and across queries to slash the search steps. It further improves per-step efficiency with a plan profiler that performs selective profiling to achieve high-fidelity performance estimates at a fraction of the profiling cost. At runtime, Compass performs query-plan bipartite matching to maximize SLO goodput under resource contentions. Real-world evaluations show that Compass improves service goodput by 2.4--5.1x, reduces deployment costs by 3.8--4.5x, and accelerates planning by 4.2--10.5x, achieving service responsiveness within seconds and near-optimal decision quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Compass brings a decomposed planner to compound AI serving with promising gains but the optimality preservation after decomposition is not rigorously shown.

read the letter

The main thing to know is that Compass decomposes the many-query multi-SLO planning problem for compound AI pipelines, exploits plan similarities to cut search effort, adds selective profiling, and uses runtime bipartite matching, with reported gains of 2.4-5.1x in goodput, 3.8-4.5x lower costs, and 4.2-10.5x faster planning. Those numbers are the headline result, but the claim that the decomposition still delivers near-optimal global decisions rests on an unproven assumption.

Referee Report

1 major / 1 minor

Summary. The paper presents Compass, the first SLO-aware query planner for large-scale compound AI workloads spanning cloud and edge deployments. It decomposes the many-query multi-SLO planning problem into tractable subproblems, exploits plan similarities within and across queries to reduce search steps, introduces a selective plan profiler for efficient performance estimates, and uses query-plan bipartite matching at runtime to maximize SLO goodput under contention. Real-world evaluations report 2.4--5.1x higher service goodput, 3.8--4.5x lower deployment costs, and 4.2--10.5x faster planning, with responsiveness in seconds and near-optimal decision quality.

Significance. If the decomposition and similarity exploitation indeed preserve near-optimal global decisions, the work would be significant for enabling practical, real-time serving of compound AI pipelines with heterogeneous SLOs on shared infrastructure. The reported empirical gains are substantial and directly relevant to deployment challenges in generative AI and edge-assisted applications; the absence of machine-checked proofs or parameter-free derivations is offset by the concrete system-level evaluation focus.

major comments (1)

[Abstract] Abstract (Compass approach paragraph): the central claim that 'decomposes the many-query, multi-SLO planning problem into tractable subproblems while preserving global decision quality' lacks approximation bounds, worst-case analysis, or any small-scale comparison against an exact solver such as ILP or exhaustive enumeration. This is load-bearing for the 2.4--5.1x goodput and near-optimal quality assertions, because similarity-based pruning could discard cross-query allocations that only appear optimal at the global level under contention.

minor comments (1)

[Abstract] The abstract refers to 'real-world evaluations' and 'near-optimal decision quality' without specifying workload characteristics, baseline systems, or statistical significance tests; these details should be expanded in the evaluation section for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical significance of Compass for compound AI serving. We address the major comment below and will revise the manuscript to strengthen the supporting evidence for our central claims.

read point-by-point responses

Referee: [Abstract] Abstract (Compass approach paragraph): the central claim that 'decomposes the many-query, multi-SLO planning problem into tractable subproblems while preserving global decision quality' lacks approximation bounds, worst-case analysis, or any small-scale comparison against an exact solver such as ILP or exhaustive enumeration. This is load-bearing for the 2.4--5.1x goodput and near-optimal quality assertions, because similarity-based pruning could discard cross-query allocations that only appear optimal at the global level under contention.

Authors: We thank the referee for identifying this gap in the presentation of our claims. Compass decomposes the joint planning problem by first generating candidate plans per query (with intra- and inter-query similarity pruning to reduce the search space) and then resolving resource contention via bipartite matching at runtime. The matching step explicitly accounts for cross-query interactions and contention, which is how global quality is intended to be preserved. We acknowledge that the manuscript currently lacks formal approximation bounds or worst-case analysis, which is a limitation given the NP-hard multi-objective nature of the problem. However, the evaluation sections already compare Compass against strong baselines and report near-optimal decision quality under the tested workloads. To directly address the referee's concern, we will add a new small-scale experiment in the revised manuscript that solves an ILP formulation (using a standard solver) on instances with 5-10 queries where exact solutions remain tractable. This will quantify the optimality gap introduced by decomposition and pruning. We will also update the abstract to reference these results. On the specific worry about pruning discarding globally optimal allocations, the similarity metric only removes plans that are strictly dominated across all SLO dimensions, and the runtime matching re-optimizes assignments; we will clarify this mechanism and its safeguards in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents Compass as an algorithmic query planner whose core claims rest on a decomposition strategy and plan-similarity exploitation, with performance gains (goodput, cost, planning time) demonstrated exclusively through external real-world evaluations on compound AI workloads. No equations, fitted parameters, or self-citations are shown to reduce the preservation of global decision quality to a definitional tautology or input renaming; the approach is framed as a practical system whose value is measured against independent benchmarks rather than derived by construction from its own assumptions. The derivation chain therefore remains self-contained against external measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on domain assumptions about the structure of compound-AI planning problems and the effectiveness of similarity-based decomposition; no free parameters or invented entities are identifiable from the abstract alone.

axioms (2)

domain assumption Diverse SLOs, heterogeneous device speeds, and high query volume render existing planning methods intractable
Stated as the motivation for Compass in the abstract.
domain assumption Decomposing the planning problem into tractable subproblems while exploiting plan similarities preserves global decision quality
Core premise of the Compass design described in the abstract.

pith-pipeline@v0.9.0 · 5824 in / 1465 out tokens · 55126 ms · 2026-05-22T19:09:51.759344+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache
cs.DC 2026-04 unverdicted novelty 6.0

ForkKV uses copy-on-write disaggregated KV cache with DualRadixTree and ResidualAttention kernels to deliver up to 3x throughput over prior multi-LoRA serving systems with negligible quality loss.

Reference graph

Works this paper leans on

108 extracted references · 108 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

Gemma 3 Technical Report

Gemma 3. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

The top 100 gen ai consumer apps

A16Z. The top 100 gen ai consumer apps. A16Z, 2025

work page 2025
[3]

Tarzan: Passively-learned real-time rate control for video conferencing

Neil Agarwal, Rui Pan, Francis Y Yan, and Ravi Netravali. Tarzan: Passively-learned real-time rate control for video conferencing. arXiv preprint arXiv:2410.03339, 2024

work page arXiv 2024
[4]

Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve. In OSDI, 2024

work page 2024
[5]

CherryPick: Adaptively unearthing the best cloud configurations for big data analytics

Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. CherryPick: Adaptively unearthing the best cloud configurations for big data analytics. In NSDI, 2017

work page 2017
[6]

Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Fara- jtabar

Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, S. Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Fara- jtabar. Llm in a flash: Efficient large language model inference with limited memory. In ACL, 2024

work page 2024
[7]

https://www.aboutamazon.com/news/retail/ how-to-use-amazon-rufus , 2024

Amazon’s generative ai-powered shopping assistant. https://www.aboutamazon.com/news/retail/ how-to-use-amazon-rufus , 2024

work page 2024
[8]

https://ir.amd.com/ news-events/press-releases/detail/663/ amd-reveals-worlds-first-hardware-virtualized-gpu-product-line

Amd reveals world’s first hardware-virtualized gpu product line. https://ir.amd.com/ news-events/press-releases/detail/663/ amd-reveals-worlds-first-hardware-virtualized-gpu-product-line

work page
[9]

Chatgpt free vs paid: What’s the difference? Apidog, 2023

Apidog. Chatgpt free vs paid: What’s the difference? Apidog, 2023

work page 2023
[10]

wav2vec 2.0: A framework for self-supervised learning of speech representations

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems , 33:12449–12460, 2020

work page 2020
[11]

vtrain: A simulation framework for evaluating cost-effective and compute- optimal large language model training, 2023

Jehyeon Bang, Yujeong Choi, Myeongwoo Kim, Yongdeok Kim, and Minsoo Rhu. vtrain: A simulation framework for evaluating cost-effective and compute- optimal large language model training, 2023

work page 2023
[12]

On la- tency of e-commerce platforms

Marcus Basalla, Johannes Schneider, Martin Luksik, Roope Jaakonmäki, and Jan V om Brocke. On la- tency of e-commerce platforms. Journal of Organiza- tional Computing and Electronic Commerce, 31(1):1– 17, 2021

work page 2021
[13]

Ekya: Continuous learning of video analytics models on edge compute servers

Romil Bhardwaj, Zhengxu Xia, Ganesh Anantha- narayanan, Junchen Jiang, Yuanchao Shu, Nikolaos Karianakis, Kevin Hsieh, Paramvir Bahl, and Ion Sto- ica. Ekya: Continuous learning of video analytics models on edge compute servers. In 19th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 22), pages 119–135, 2022

work page 2022
[14]

Recall: Empowering mul- timodal embedding for edge devices

Dongqi Cai, Shangguang Wang, Chen Peng, Zeling Zhang, and Mengwei Xu. Recall: Empowering mul- timodal embedding for edge devices. In arXiv: 2409.15342, 2024

work page arXiv 2024
[15]

https://openai.com/index/ chatgpt-can-now-see-hear-and-speak/

Chatgpt can now see, hear, and speak. https://openai.com/index/ chatgpt-can-now-see-hear-and-speak/

work page
[16]

Internvl: Scaling up vi- sion foundation models and aligning for generic visual- linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vi- sion foundation models and aligning for generic visual- linguistic tasks. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

work page 2024
[17]

30+ deepseek statistics: How this ai model is changing the game

Cropink. 30+ deepseek statistics: How this ai model is changing the game. Cropink, 2025

work page 2025
[18]

https://paperswithcode.com/ dataset/dancetrack, 2020

Dancetrack. https://paperswithcode.com/ dataset/dancetrack, 2020

work page 2020
[19]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

DeepSeek. Deepseek-coder: When the large language model meets programming – the rise of code intelli- gence. arXiv preprint arXiv:2401.14196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Llm.int8(): 8-bit matrix multiplication for transformers at scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale. In NeurIPS, 2022

work page 2022
[21]

Ds-1000: A natural and reliable bench- mark for data science code generation

DS-1000. Ds-1000: A natural and reliable bench- mark for data science code generation. arXiv preprint arXiv:2211.11501, 2022

work page arXiv 2022
[22]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Matthew W. G. Dye, Shawn C. Green, and Daphné Bavelier. Increasing speed of processing with action video games. Current Directions in Psychological Science, 2009

work page 2009
[24]

https: //huggingface.co/datasets/google/fleurs, 2022

Flores machine translation benchmark. https: //huggingface.co/datasets/google/fleurs, 2022

work page 2022
[25]

Gardner, Matt J

Jacob R. Gardner, Matt J. Kusner, Zhixiang (Eddie) Xu, Kilian Q. Weinberger, and John P. Cunningham. Bayesian optimization with inequality constraints. In ICML, 2014

work page 2014
[26]

https://cloud

The evolution of play: From live to living games. https://cloud. google.com/blog/products/gaming/ generative-ai-fuels-next-gen-living-games , 2024

work page 2024
[27]

https://cloud.google

Iot platform product architecture on google cloud. https://cloud.google. com/architecture/connected-devices/ iot-platform-product-architecture

work page
[28]

https://cloud.google

Google cloud: Gpu pricing. https://cloud.google. com/compute/gpus-pricing?hl=en

work page
[29]

https://cloud.google.com/ gemini-api/pricing

Gemini api pricing. https://cloud.google.com/ gemini-api/pricing

work page
[30]

https://deepmind.google/discover/blog/ genie-2-a-large-scale-foundation-world-model/ , 2024

Genie 2: A large-scale foundation world model. https://deepmind.google/discover/blog/ genie-2-a-large-scale-foundation-world-model/ , 2024

work page 2024
[31]

https://spritea.github.io/ GMOT40/, 2021

Gmot-40(jì mò-40). https://spritea.github.io/ GMOT40/, 2021

work page 2021
[32]

W. S. Gosset. The probable error of a mean. Biometrika, 6(1):1–25, 1908

work page 1908
[33]

gRPC: A High Performance, Open Source Universal RPC Framework

gRPC Authors. gRPC: A High Performance, Open Source Universal RPC Framework. https://grpc. io/

work page
[34]

Fila: Online audit- ing of machine learning model accuracy under finite labelling budget

Naiqing Guan and Nick Koudas. Fila: Online audit- ing of machine learning model accuracy under finite labelling budget. In SIGMOD, 2022

work page 2022
[35]

Serving DNNs like clockwork: Performance predictability from the bottom up

Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. Serving DNNs like clockwork: Performance predictability from the bottom up. In OSDI, 2020

work page 2020
[36]

Deep residual learning for image recognition, 2015

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015

work page 2015
[37]

An efficient bandit algorithm for realtime multivariate optimization

Daniel N Hill, Houssam Nassif, Yi Liu, Anand Iyer, and SVN Vishwanathan. An efficient bandit algorithm for realtime multivariate optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1813– 1821, 2017

work page 2017
[38]

Honeywell Forge: Enterprise Performance Management for Industrials

Honeywell. Honeywell Forge: Enterprise Performance Management for Industrials. https://www.honeywell.com/us/en/solutions/ honeywell-forge, 2024

work page 2024
[39]

Gibbons, and Onur Mutlu

Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman, Paramvir Bahl, Matthai Phili- pose, Phillip B. Gibbons, and Onur Mutlu. Focus: Querying large video datasets with low latency and low cost. In OSDI, 2018

work page 2018
[40]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Ab- delrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

work page 2021
[41]

Multimodal pretraining for dense video captioning

Gabriel Huang, Bo Pang, Zhenhai Zhu, Clara Rivera, and Radu Soricut. Multimodal pretraining for dense video captioning. In AACL-IJCNLP 2020, 2020

work page 2020
[42]

Evaluating Large Language Models Trained on Code

HumanEval. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[43]

Chatgpt facts and statistics you need to know in 2025

Invgate. Chatgpt facts and statistics you need to know in 2025. Invgate, 2025

work page 2025
[44]

Chameleon: scalable adaptation of video analytics

Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, Siddhartha Sen, and Ion Stoica. Chameleon: scalable adaptation of video analytics. In Proceedings of the 2018 conference of the ACM special interest group on data communication, pages 253–266, 2018

work page 2018
[45]

https://kubernetes.io/

Kubernetes: Production-grade container scheduling and management. https://kubernetes.io/

work page
[46]

Advances and open problems in federated learning,

Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cor- mode, Rachel Cummings, Rafael G. L. DOliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Har- chaoui, ...

work page arXiv 1912
[47]

{RECL}: Re- sponsive {Resource-Efficient} continuous learning for video analytics

Mehrdad Khani, Ganesh Ananthanarayanan, Kevin Hsieh, Junchen Jiang, Ravi Netravali, Yuanchao Shu, Mohammad Alizadeh, and Victor Bahl. {RECL}: Re- sponsive {Resource-Efficient} continuous learning for video analytics. In 20th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 23), pages 917–932, 2023

work page 2023
[48]

Selecta: heterogeneous cloud storage configuration for data analytics

Ana Klimovic, Heiner Litz, and Christos Kozyrakis. Selecta: heterogeneous cloud storage configuration for data analytics. In ATC, 2018

work page 2018
[49]

Cascadeserve: Unlock- ing model cascades for inference serving, 2024

Ferdi Kossmann, Ziniu Wu, Alex Turk, Nesime Tatbul, Lei Cao, and Samuel Madden. Cascadeserve: Unlock- ing model cascades for inference serving, 2024

work page 2024
[50]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In International Conference on Computer Vision (ICCV), 2017

work page 2017
[51]

Gon- zalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gon- zalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In SOSP, 2023

work page 2023
[52]

Singapuram, Jiachen Liu, Xiangfeng Zhu, Harsha V

Fan Lai, Yinwei Dai, Sanjay S. Singapuram, Jiachen Liu, Xiangfeng Zhu, Harsha V . Madhyastha, and Mosharaf Chowdhury. FedScale: Benchmarking model and system performance of federated learning at scale. In International Conference on Machine Learn- ing (ICML), 2022

work page 2022
[53]

Madhyastha, and Mosharaf Chowdhury

Fan Lai, Xiangfeng Zhu, Harsha V . Madhyastha, and Mosharaf Chowdhury. Oort: Efficient federated learn- ing via guided participant selection. In OSDI, 2021

work page 2021
[54]

https://www.langchain.com/

Langchain. https://www.langchain.com/

work page
[55]

Gonzalez, and Ion Sto- ica

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Sto- ica. AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In OSDI, 2023

work page 2023
[56]

https://huggingface.co/datasets/ openslr/librispeech_asr, 2021

Librispeech. https://huggingface.co/datasets/ openslr/librispeech_asr, 2021

work page 2021
[57]

Awq: Activation-aware weight quantization for llm compres- sion and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compres- sion and acceleration. In MLSys, 2024

work page 2024
[58]

Andes: Defining and enhancing quality-of-experience in llm- based text streaming services

Jiachen Liu, Zhiyu Wu, Jae-Won Chung, Fan Lai, Myungjin Lee, and Mosharaf Chowdhury. Andes: Defining and enhancing quality-of-experience in llm- based text streaming services. In arXiv: 2404.16283, 2024

work page arXiv 2024
[59]

https://huggingface.co/ datasets/LIUM/tedlium, 2022

Ted-lium corpus. https://huggingface.co/ datasets/LIUM/tedlium, 2022

work page 2022
[60]

Walle: An End- to-End, General-Purpose, and Large-Scale production system for Device-Cloud collaborative machine learn- ing

Chengfei Lv, Chaoyue Niu, Renjie Gu, Xiaotang Jiang, Zhaode Wang, Bin Liu, Ziqi Wu, Qiulin Yao, Con- gyu Huang, Panos Huang, Tao Huang, Hui Shu, Jinde Song, Bin Zou, Peng Lan, Guohuan Xu, Fei Wu, Shao- jie Tang, Fan Wu, and Guihai Chen. Walle: An End- to-End, General-Purpose, and Large-Scale production system for Device-Cloud collaborative machine learn- ...

work page 2022
[61]

Program Synthesis with Large Language Models

MBPP. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[62]

Helix: Distributed serving of large language models via max-flow on heterogeneous gpus

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. Helix: Distributed serving of large language models via max-flow on heterogeneous gpus. In ASPLOS, 2025

work page 2025
[63]

https://huggingface.co/datasets/ PolyAI/minds14, 2022

Minds-14. https://huggingface.co/datasets/ PolyAI/minds14, 2022

work page 2022
[64]

https://www.measurementlab.net/ tests/mobiperf/

Mobiperf: Measuring network performance on mobile platforms. https://www.measurementlab.net/ tests/mobiperf/

work page
[65]

https://motchallenge.net/ data/MOT17/, 2017

Mot17 challenge. https://motchallenge.net/ data/MOT17/, 2017

work page 2017
[66]

https://motchallenge.net/ data/MOT20/, 2020

Mot20 challenge. https://motchallenge.net/ data/MOT20/, 2020

work page 2020
[67]

https://www.microsoft.com/en-us/research/ project/live-video-analytics/, 2020

Microsoft rocket for live video analytics. https://www.microsoft.com/en-us/research/ project/live-video-analytics/, 2020

work page 2020
[68]

https://doc-doc.github

Next-qa: Next phase of question-answering to explain- ing temporal actions. https://doc-doc.github. io/docs/nextqa.html, 2021

work page 2021
[69]

https://docs

Nvidia multi-process service. https://docs. nvidia.com/deploy/mps/index.html#/

work page
[70]

https://openai.com/index/ introducing-our-next-generation-audio-models/

Openai: Introducing next-generation audio mod- els in the api. https://openai.com/index/ introducing-our-next-generation-audio-models/

work page
[71]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492–28518. PMLR, 2023

work page 2023
[72]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[73]

Astra-sim: Enabling sw/hw co-design exploration for distributed dl training plat- forms

Saeed Rashidi, Srinivas Sridharan, Sudarshan Srini- vasan, and Tushar Krishna. Astra-sim: Enabling sw/hw co-design exploration for distributed dl training plat- forms. In ISPASS, 2020

work page 2020
[74]

You only look once: Unified, real-time ob- ject detection

J Redmon. You only look once: Unified, real-time ob- ject detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016

work page 2016
[75]

Yadwadkar, and Christos Kozyrakis

Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. INFaaS: Automated model-less inference serving. In ATC, 2021

work page 2021
[76]

Gonzalez, and Ion Stoica

Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, and Ion Stoica. Fairness in serving large language models. In OSDI, 2024

work page 2024
[77]

Practical bayesian optimization of machine learning algorithms

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25, 2012

work page 2012
[78]

Mostofa Ali Patwary, Prabhat, and Ryan P

Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat, and Ryan P. Adams. Scalable bayesian optimization using deep neural net- works. In ICML, 2015

work page 2015
[79]

Powerinfer: Fast large language model serving with a consumer-grade gpu

Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. Powerinfer: Fast large language model serving with a consumer-grade gpu. In SOSP, 2024

work page 2024
[80]

https://paperswithcode

Sportsmot: A large multi-object tracking dataset in multiple sports scenes. https://paperswithcode. com/dataset/sportsmot, 2020

work page 2020

Showing first 80 references.

[1] [1]

Gemma 3 Technical Report

Gemma 3. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

The top 100 gen ai consumer apps

A16Z. The top 100 gen ai consumer apps. A16Z, 2025

work page 2025

[3] [3]

Tarzan: Passively-learned real-time rate control for video conferencing

Neil Agarwal, Rui Pan, Francis Y Yan, and Ravi Netravali. Tarzan: Passively-learned real-time rate control for video conferencing. arXiv preprint arXiv:2410.03339, 2024

work page arXiv 2024

[4] [4]

Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve. In OSDI, 2024

work page 2024

[5] [5]

CherryPick: Adaptively unearthing the best cloud configurations for big data analytics

Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. CherryPick: Adaptively unearthing the best cloud configurations for big data analytics. In NSDI, 2017

work page 2017

[6] [6]

Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Fara- jtabar

Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, S. Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Fara- jtabar. Llm in a flash: Efficient large language model inference with limited memory. In ACL, 2024

work page 2024

[7] [7]

https://www.aboutamazon.com/news/retail/ how-to-use-amazon-rufus , 2024

Amazon’s generative ai-powered shopping assistant. https://www.aboutamazon.com/news/retail/ how-to-use-amazon-rufus , 2024

work page 2024

[8] [8]

https://ir.amd.com/ news-events/press-releases/detail/663/ amd-reveals-worlds-first-hardware-virtualized-gpu-product-line

Amd reveals world’s first hardware-virtualized gpu product line. https://ir.amd.com/ news-events/press-releases/detail/663/ amd-reveals-worlds-first-hardware-virtualized-gpu-product-line

work page

[9] [9]

Chatgpt free vs paid: What’s the difference? Apidog, 2023

Apidog. Chatgpt free vs paid: What’s the difference? Apidog, 2023

work page 2023

[10] [10]

wav2vec 2.0: A framework for self-supervised learning of speech representations

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems , 33:12449–12460, 2020

work page 2020

[11] [11]

vtrain: A simulation framework for evaluating cost-effective and compute- optimal large language model training, 2023

Jehyeon Bang, Yujeong Choi, Myeongwoo Kim, Yongdeok Kim, and Minsoo Rhu. vtrain: A simulation framework for evaluating cost-effective and compute- optimal large language model training, 2023

work page 2023

[12] [12]

On la- tency of e-commerce platforms

Marcus Basalla, Johannes Schneider, Martin Luksik, Roope Jaakonmäki, and Jan V om Brocke. On la- tency of e-commerce platforms. Journal of Organiza- tional Computing and Electronic Commerce, 31(1):1– 17, 2021

work page 2021

[13] [13]

Ekya: Continuous learning of video analytics models on edge compute servers

Romil Bhardwaj, Zhengxu Xia, Ganesh Anantha- narayanan, Junchen Jiang, Yuanchao Shu, Nikolaos Karianakis, Kevin Hsieh, Paramvir Bahl, and Ion Sto- ica. Ekya: Continuous learning of video analytics models on edge compute servers. In 19th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 22), pages 119–135, 2022

work page 2022

[14] [14]

Recall: Empowering mul- timodal embedding for edge devices

Dongqi Cai, Shangguang Wang, Chen Peng, Zeling Zhang, and Mengwei Xu. Recall: Empowering mul- timodal embedding for edge devices. In arXiv: 2409.15342, 2024

work page arXiv 2024

[15] [15]

https://openai.com/index/ chatgpt-can-now-see-hear-and-speak/

Chatgpt can now see, hear, and speak. https://openai.com/index/ chatgpt-can-now-see-hear-and-speak/

work page

[16] [16]

Internvl: Scaling up vi- sion foundation models and aligning for generic visual- linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vi- sion foundation models and aligning for generic visual- linguistic tasks. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

work page 2024

[17] [17]

30+ deepseek statistics: How this ai model is changing the game

Cropink. 30+ deepseek statistics: How this ai model is changing the game. Cropink, 2025

work page 2025

[18] [18]

https://paperswithcode.com/ dataset/dancetrack, 2020

Dancetrack. https://paperswithcode.com/ dataset/dancetrack, 2020

work page 2020

[19] [19]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

DeepSeek. Deepseek-coder: When the large language model meets programming – the rise of code intelli- gence. arXiv preprint arXiv:2401.14196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Llm.int8(): 8-bit matrix multiplication for transformers at scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale. In NeurIPS, 2022

work page 2022

[21] [21]

Ds-1000: A natural and reliable bench- mark for data science code generation

DS-1000. Ds-1000: A natural and reliable bench- mark for data science code generation. arXiv preprint arXiv:2211.11501, 2022

work page arXiv 2022

[22] [22]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Matthew W. G. Dye, Shawn C. Green, and Daphné Bavelier. Increasing speed of processing with action video games. Current Directions in Psychological Science, 2009

work page 2009

[24] [24]

https: //huggingface.co/datasets/google/fleurs, 2022

Flores machine translation benchmark. https: //huggingface.co/datasets/google/fleurs, 2022

work page 2022

[25] [25]

Gardner, Matt J

Jacob R. Gardner, Matt J. Kusner, Zhixiang (Eddie) Xu, Kilian Q. Weinberger, and John P. Cunningham. Bayesian optimization with inequality constraints. In ICML, 2014

work page 2014

[26] [26]

https://cloud

The evolution of play: From live to living games. https://cloud. google.com/blog/products/gaming/ generative-ai-fuels-next-gen-living-games , 2024

work page 2024

[27] [27]

https://cloud.google

Iot platform product architecture on google cloud. https://cloud.google. com/architecture/connected-devices/ iot-platform-product-architecture

work page

[28] [28]

https://cloud.google

Google cloud: Gpu pricing. https://cloud.google. com/compute/gpus-pricing?hl=en

work page

[29] [29]

https://cloud.google.com/ gemini-api/pricing

Gemini api pricing. https://cloud.google.com/ gemini-api/pricing

work page

[30] [30]

https://deepmind.google/discover/blog/ genie-2-a-large-scale-foundation-world-model/ , 2024

Genie 2: A large-scale foundation world model. https://deepmind.google/discover/blog/ genie-2-a-large-scale-foundation-world-model/ , 2024

work page 2024

[31] [31]

https://spritea.github.io/ GMOT40/, 2021

Gmot-40(jì mò-40). https://spritea.github.io/ GMOT40/, 2021

work page 2021

[32] [32]

W. S. Gosset. The probable error of a mean. Biometrika, 6(1):1–25, 1908

work page 1908

[33] [33]

gRPC: A High Performance, Open Source Universal RPC Framework

gRPC Authors. gRPC: A High Performance, Open Source Universal RPC Framework. https://grpc. io/

work page

[34] [34]

Fila: Online audit- ing of machine learning model accuracy under finite labelling budget

Naiqing Guan and Nick Koudas. Fila: Online audit- ing of machine learning model accuracy under finite labelling budget. In SIGMOD, 2022

work page 2022

[35] [35]

Serving DNNs like clockwork: Performance predictability from the bottom up

Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. Serving DNNs like clockwork: Performance predictability from the bottom up. In OSDI, 2020

work page 2020

[36] [36]

Deep residual learning for image recognition, 2015

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015

work page 2015

[37] [37]

An efficient bandit algorithm for realtime multivariate optimization

Daniel N Hill, Houssam Nassif, Yi Liu, Anand Iyer, and SVN Vishwanathan. An efficient bandit algorithm for realtime multivariate optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1813– 1821, 2017

work page 2017

[38] [38]

Honeywell Forge: Enterprise Performance Management for Industrials

Honeywell. Honeywell Forge: Enterprise Performance Management for Industrials. https://www.honeywell.com/us/en/solutions/ honeywell-forge, 2024

work page 2024

[39] [39]

Gibbons, and Onur Mutlu

Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman, Paramvir Bahl, Matthai Phili- pose, Phillip B. Gibbons, and Onur Mutlu. Focus: Querying large video datasets with low latency and low cost. In OSDI, 2018

work page 2018

[40] [40]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Ab- delrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

work page 2021

[41] [41]

Multimodal pretraining for dense video captioning

Gabriel Huang, Bo Pang, Zhenhai Zhu, Clara Rivera, and Radu Soricut. Multimodal pretraining for dense video captioning. In AACL-IJCNLP 2020, 2020

work page 2020

[42] [42]

Evaluating Large Language Models Trained on Code

HumanEval. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[43] [43]

Chatgpt facts and statistics you need to know in 2025

Invgate. Chatgpt facts and statistics you need to know in 2025. Invgate, 2025

work page 2025

[44] [44]

Chameleon: scalable adaptation of video analytics

Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, Siddhartha Sen, and Ion Stoica. Chameleon: scalable adaptation of video analytics. In Proceedings of the 2018 conference of the ACM special interest group on data communication, pages 253–266, 2018

work page 2018

[45] [45]

https://kubernetes.io/

Kubernetes: Production-grade container scheduling and management. https://kubernetes.io/

work page

[46] [46]

Advances and open problems in federated learning,

Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cor- mode, Rachel Cummings, Rafael G. L. DOliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Har- chaoui, ...

work page arXiv 1912

[47] [47]

{RECL}: Re- sponsive {Resource-Efficient} continuous learning for video analytics

Mehrdad Khani, Ganesh Ananthanarayanan, Kevin Hsieh, Junchen Jiang, Ravi Netravali, Yuanchao Shu, Mohammad Alizadeh, and Victor Bahl. {RECL}: Re- sponsive {Resource-Efficient} continuous learning for video analytics. In 20th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 23), pages 917–932, 2023

work page 2023

[48] [48]

Selecta: heterogeneous cloud storage configuration for data analytics

Ana Klimovic, Heiner Litz, and Christos Kozyrakis. Selecta: heterogeneous cloud storage configuration for data analytics. In ATC, 2018

work page 2018

[49] [49]

Cascadeserve: Unlock- ing model cascades for inference serving, 2024

Ferdi Kossmann, Ziniu Wu, Alex Turk, Nesime Tatbul, Lei Cao, and Samuel Madden. Cascadeserve: Unlock- ing model cascades for inference serving, 2024

work page 2024

[50] [50]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In International Conference on Computer Vision (ICCV), 2017

work page 2017

[51] [51]

Gon- zalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gon- zalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In SOSP, 2023

work page 2023

[52] [52]

Singapuram, Jiachen Liu, Xiangfeng Zhu, Harsha V

Fan Lai, Yinwei Dai, Sanjay S. Singapuram, Jiachen Liu, Xiangfeng Zhu, Harsha V . Madhyastha, and Mosharaf Chowdhury. FedScale: Benchmarking model and system performance of federated learning at scale. In International Conference on Machine Learn- ing (ICML), 2022

work page 2022

[53] [53]

Madhyastha, and Mosharaf Chowdhury

Fan Lai, Xiangfeng Zhu, Harsha V . Madhyastha, and Mosharaf Chowdhury. Oort: Efficient federated learn- ing via guided participant selection. In OSDI, 2021

work page 2021

[54] [54]

https://www.langchain.com/

Langchain. https://www.langchain.com/

work page

[55] [55]

Gonzalez, and Ion Sto- ica

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Sto- ica. AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In OSDI, 2023

work page 2023

[56] [56]

https://huggingface.co/datasets/ openslr/librispeech_asr, 2021

Librispeech. https://huggingface.co/datasets/ openslr/librispeech_asr, 2021

work page 2021

[57] [57]

Awq: Activation-aware weight quantization for llm compres- sion and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compres- sion and acceleration. In MLSys, 2024

work page 2024

[58] [58]

Andes: Defining and enhancing quality-of-experience in llm- based text streaming services

Jiachen Liu, Zhiyu Wu, Jae-Won Chung, Fan Lai, Myungjin Lee, and Mosharaf Chowdhury. Andes: Defining and enhancing quality-of-experience in llm- based text streaming services. In arXiv: 2404.16283, 2024

work page arXiv 2024

[59] [59]

https://huggingface.co/ datasets/LIUM/tedlium, 2022

Ted-lium corpus. https://huggingface.co/ datasets/LIUM/tedlium, 2022

work page 2022

[60] [60]

Walle: An End- to-End, General-Purpose, and Large-Scale production system for Device-Cloud collaborative machine learn- ing

Chengfei Lv, Chaoyue Niu, Renjie Gu, Xiaotang Jiang, Zhaode Wang, Bin Liu, Ziqi Wu, Qiulin Yao, Con- gyu Huang, Panos Huang, Tao Huang, Hui Shu, Jinde Song, Bin Zou, Peng Lan, Guohuan Xu, Fei Wu, Shao- jie Tang, Fan Wu, and Guihai Chen. Walle: An End- to-End, General-Purpose, and Large-Scale production system for Device-Cloud collaborative machine learn- ...

work page 2022

[61] [61]

Program Synthesis with Large Language Models

MBPP. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[62] [62]

Helix: Distributed serving of large language models via max-flow on heterogeneous gpus

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. Helix: Distributed serving of large language models via max-flow on heterogeneous gpus. In ASPLOS, 2025

work page 2025

[63] [63]

https://huggingface.co/datasets/ PolyAI/minds14, 2022

Minds-14. https://huggingface.co/datasets/ PolyAI/minds14, 2022

work page 2022

[64] [64]

https://www.measurementlab.net/ tests/mobiperf/

Mobiperf: Measuring network performance on mobile platforms. https://www.measurementlab.net/ tests/mobiperf/

work page

[65] [65]

https://motchallenge.net/ data/MOT17/, 2017

Mot17 challenge. https://motchallenge.net/ data/MOT17/, 2017

work page 2017

[66] [66]

https://motchallenge.net/ data/MOT20/, 2020

Mot20 challenge. https://motchallenge.net/ data/MOT20/, 2020

work page 2020

[67] [67]

https://www.microsoft.com/en-us/research/ project/live-video-analytics/, 2020

Microsoft rocket for live video analytics. https://www.microsoft.com/en-us/research/ project/live-video-analytics/, 2020

work page 2020

[68] [68]

https://doc-doc.github

Next-qa: Next phase of question-answering to explain- ing temporal actions. https://doc-doc.github. io/docs/nextqa.html, 2021

work page 2021

[69] [69]

https://docs

Nvidia multi-process service. https://docs. nvidia.com/deploy/mps/index.html#/

work page

[70] [70]

https://openai.com/index/ introducing-our-next-generation-audio-models/

Openai: Introducing next-generation audio mod- els in the api. https://openai.com/index/ introducing-our-next-generation-audio-models/

work page

[71] [71]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492–28518. PMLR, 2023

work page 2023

[72] [72]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

work page 2020

[73] [73]

Astra-sim: Enabling sw/hw co-design exploration for distributed dl training plat- forms

Saeed Rashidi, Srinivas Sridharan, Sudarshan Srini- vasan, and Tushar Krishna. Astra-sim: Enabling sw/hw co-design exploration for distributed dl training plat- forms. In ISPASS, 2020

work page 2020

[74] [74]

You only look once: Unified, real-time ob- ject detection

J Redmon. You only look once: Unified, real-time ob- ject detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016

work page 2016

[75] [75]

Yadwadkar, and Christos Kozyrakis

Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. INFaaS: Automated model-less inference serving. In ATC, 2021

work page 2021

[76] [76]

Gonzalez, and Ion Stoica

Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, and Ion Stoica. Fairness in serving large language models. In OSDI, 2024

work page 2024

[77] [77]

Practical bayesian optimization of machine learning algorithms

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25, 2012

work page 2012

[78] [78]

Mostofa Ali Patwary, Prabhat, and Ryan P

Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat, and Ryan P. Adams. Scalable bayesian optimization using deep neural net- works. In ICML, 2015

work page 2015

[79] [79]

Powerinfer: Fast large language model serving with a consumer-grade gpu

Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. Powerinfer: Fast large language model serving with a consumer-grade gpu. In SOSP, 2024

work page 2024

[80] [80]

https://paperswithcode

Sportsmot: A large multi-object tracking dataset in multiple sports scenes. https://paperswithcode. com/dataset/sportsmot, 2020

work page 2020