CoLLM: Continuous Adaptation for SLO-Aware LLM Serving on Shared GPU Clusters

Na Yan; Shaoyuan Huang; Tiancheng Zhang; Wenyu Wang; Xiaofei Wang; Xiaokai Wang; Yansha Deng; Yunfeng Zhao

arxiv: 2604.16400 · v2 · pith:37OXJOHNnew · submitted 2026-03-31 · 💻 cs.DC · cs.AI· cs.LG

CoLLM: Continuous Adaptation for SLO-Aware LLM Serving on Shared GPU Clusters

Shaoyuan Huang , Yunfeng Zhao , Na Yan , Tiancheng Zhang , Xiaokai Wang , Xiaofei Wang , Wenyu Wang , Yansha Deng This is my paper

Pith reviewed 2026-05-21 10:29 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LG

keywords LLM servingfine-tuningedge intelligenceGPU clustersparameter sharingSLO awarenesscontinuous adaptationfederated learning

0 comments

The pith

CoLLM unifies fine-tuning and inference on shared edge GPU replicas so that model updates improve serving quality without extra deployments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CoLLM as a system that places federated parameter-efficient fine-tuning and inference on the same replicas instead of running them separately. An intra-replica sharing method keeps inference paths unmerged and uses shadow adapters so that fresh parameters reach inference immediately. A two-timescale coordination routine then moves work across replicas to protect short-term response targets while still harvesting long-term quality gains from fine-tuning. If the approach holds, edge clusters no longer need duplicate hardware pools or wait for separate retraining cycles before new adaptations appear in production.

Core claim

CoLLM shows that a unified co-execution framework on shared replicas, built from unmerged inference plus shadow adapters inside each replica and a two-timescale balancing algorithm between replicas, can jointly raise long-term model quality and short-term inference efficiency, delivering up to three times the goodput of systems that isolate the two workloads.

What carries the argument

Intra-replica model sharing via unmerged inference and shadow adapter strategies, paired with a two-timescale inter-replica coordination algorithm that balances workloads.

If this is right

Fewer total model copies are needed because one replica set supports both ongoing adaptation and live serving.
Inference quality improves as soon as fine-tuning updates arrive rather than after a separate deployment step.
Service-level objectives for latency and throughput can be maintained while model quality still advances over time.
Edge deployments of domain-specific LLMs become faster because training and serving share the same resource footprint.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sharing pattern could collapse separate training and prediction pools in other paired workloads such as online learning for recommendation systems.
Operators might cut overall GPU hours by merging what are now treated as two independent resource classes.
Extensions could allow multiple simultaneous adaptation streams to run on one replica while preserving the same balancing logic.

Load-bearing premise

Unmerged inference and shadow adapters can reuse parameters in real time without unacceptable overhead or correctness issues on the shared replicas.

What would settle it

A side-by-side measurement of end-to-end latency and accuracy when fine-tuning runs on the same replica as inference versus when the two tasks use entirely separate replicas.

Figures

Figures reproduced from arXiv: 2604.16400 by Na Yan, Shaoyuan Huang, Tiancheng Zhang, Wenyu Wang, Xiaofei Wang, Xiaokai Wang, Yansha Deng, Yunfeng Zhao.

**Figure 1.** Figure 1: GPU multiplexing versus model sharing. service traces show that CoLLM improves goodput (i.e., inference quality-aware throughput) by up to 3× over state-of-the-art LLMs serving systems. II. BACKGROUND AND MOTIVATION A. GPU Multiplexing and Model Sharing Previous studies have proposed different methods for scheduling heterogeneous workloads (in our case, FL PEFT and inference) on shared resources (e.g., GPU… view at source ↗

**Figure 2.** Figure 2: Inference CE loss of different execution paradigms. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Merged/unmerged inference vs. shadow adapter-based [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Inference throughput and goodput of different systems across two LLMs and two domain-specific tasks. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: System performance under different load scales. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

As Large Language Models (LLMs) are increasingly adopted in edge intelligence to power domain-specific applications and personalized services, the quality and efficiency of the LLM post-training phase-including fine-tuning and inference, have become critical due to constrained resources. Although recent advances in federated parameter-efficient fine-tuning (FL PEFT) and low-latency inference have improved individual task performance, fine-tuning and inference are still handled as isolated workloads, which overlooks their interdependence and results in redundant deployments and delayed improvement in inference quality. To address these limitations, we introduce a new co-execution framework and instantiate it with CoLLM, a system that unifies FL PEFT and inference on shared edge replicas and model parameters. CoLLM addresses key challenges at both replica and cluster levels through: (1) an intra-replica model sharing mechanism that enables real-time model parameter reuse via unmerged inference and shadow adapter strategies; and (2) a two-timescale inter-replica coordination algorithm that adaptively balances fine-tuning and inference workloads to jointly optimize long-term model quality gains and short-term inference efficiency. Extensive evaluation across diverse LLMs and real-world traces show that CoLLM consistently outperforms state-of-the-art LLM systems, achieving up to 3x higher goodput, demonstrating its effectiveness in enabling seamless LLM post-training for edge intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoLLM unifies FL PEFT and inference on shared edge replicas via unmerged inference and shadow adapters plus two-timescale coordination, but the 3x goodput claim needs tighter validation on output correctness.

read the letter

The main point is that this paper builds CoLLM to run fine-tuning and serving together on the same GPU replicas instead of treating them as separate jobs. It uses unmerged inference with shadow adapters for real-time parameter reuse inside a replica and a two-timescale algorithm to balance long-term quality gains against short-term latency across replicas. That combination is the concrete new piece, even if the individual techniques draw from prior work on PEFT and SLO-aware scheduling. The framing around edge intelligence constraints and redundant deployments is clear and practical. The evaluation summary across multiple LLMs and traces supports the idea that joint optimization can improve goodput, which is useful for anyone dealing with continuous post-training on constrained hardware. The soft spot is exactly the one the stress-test flags. The abstract and high-level description do not show direct checks that unmerged forward passes produce the same outputs or downstream accuracy as standard merged-adapter serving after the same adapter updates. If attention or activation patterns shift under concurrent changes, goodput numbers could look better while actual model quality slips. That equivalence test is load-bearing for the central claim and should be explicit. This paper is for systems people working on LLM deployment in federated or edge settings who need ideas for sharing and coordination. It is coherent on its own terms and shows honest engagement with the interdependence problem, so it deserves a serious referee even if the validation details need work in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript presents CoLLM, a system for continuous adaptation of LLMs on shared GPU clusters. It unifies federated parameter-efficient fine-tuning (FL PEFT) and inference workloads on shared replicas using an intra-replica model sharing mechanism based on unmerged inference and shadow adapters, combined with a two-timescale inter-replica coordination algorithm. The central claim is that this approach enables seamless post-training while achieving up to 3x higher goodput than state-of-the-art LLM serving systems, as demonstrated through evaluations on diverse LLMs and real-world traces.

Significance. If the performance claims hold and the sharing mechanism preserves inference quality, this work could have significant impact on edge intelligence by allowing joint optimization of model quality and serving efficiency without redundant resource allocation. The explicit handling of the interdependence between fine-tuning and inference is a strength, and the use of real-world traces adds to the practical relevance. However, the lack of detailed validation for the core mechanism limits the current assessment of its broader implications.

major comments (2)

[Section 4.1] The intra-replica model sharing mechanism via unmerged inference and shadow adapters is described, but there is no explicit validation or equivalence check (e.g., output distribution similarity or downstream task accuracy) comparing it to standard merged-adapter serving after PEFT updates. This is load-bearing for the goodput claim, as any degradation in model quality could offset the reported efficiency gains.
[Section 5] The evaluation section reports up to 3x goodput improvements but provides insufficient details on the baselines compared against, statistical error bars, specific characteristics of the workload traces, and criteria for selecting the diverse LLMs. This undermines the verifiability of the central performance claim.

minor comments (2)

[Abstract] The abstract mentions 'goodput' without a brief definition; consider adding one for clarity to readers unfamiliar with the term in this context.
[Throughout] Ensure consistent use of acronyms like FL PEFT upon first use in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with point-by-point responses and commit to revisions that strengthen the presentation without altering our core claims.

read point-by-point responses

Referee: [Section 4.1] The intra-replica model sharing mechanism via unmerged inference and shadow adapters is described, but there is no explicit validation or equivalence check (e.g., output distribution similarity or downstream task accuracy) comparing it to standard merged-adapter serving after PEFT updates. This is load-bearing for the goodput claim, as any degradation in model quality could offset the reported efficiency gains.

Authors: We acknowledge the value of explicit empirical validation for this load-bearing mechanism. By design, unmerged inference with shadow adapters computes identical outputs to merged-adapter inference because the adapter weights are applied identically in the forward pass without requiring an explicit merge step; this equivalence follows directly from the linear nature of the adapter updates. To address the referee's concern directly, we will add a dedicated subsection in Section 4.1 (or an appendix) reporting output distribution similarity metrics such as KL divergence and cosine similarity on token logits, along with downstream task accuracy comparisons on standard benchmarks for the LLMs evaluated in the paper. revision: yes
Referee: [Section 5] The evaluation section reports up to 3x goodput improvements but provides insufficient details on the baselines compared against, statistical error bars, specific characteristics of the workload traces, and criteria for selecting the diverse LLMs. This undermines the verifiability of the central performance claim.

Authors: We agree that expanded details will improve verifiability and reproducibility. In the revised manuscript we will augment Section 5 with: explicit identification and configuration details for all baselines (including vLLM, TensorRT-LLM, and other SLO-aware systems); statistical error bars and standard deviations computed over at least five independent runs; precise characteristics of the workload traces (source, request-rate distributions, SLO definitions, and durations); and the selection criteria for the LLMs (parameter scale, architecture family, and public availability). These additions will be presented in both the main text and a new reproducibility table. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical evaluation, not self-referential definitions or fitted predictions

full rationale

The paper presents CoLLM as a co-execution framework with two concrete mechanisms (intra-replica unmerged inference plus shadow adapters, and a two-timescale coordination algorithm) whose effectiveness is asserted via extensive evaluation on diverse LLMs and real-world traces, reporting up to 3x goodput gains. No equations, fitted parameters, or first-principles derivations appear in the provided text that reduce the reported outcomes to the inputs by construction. The central performance claim is therefore falsifiable against external baselines and does not rely on self-citation chains or renaming of known results. This is the normal, non-circular outcome for a systems paper whose load-bearing evidence is experimental rather than deductive.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the feasibility of unified execution without major interference and on the representativeness of the evaluation traces; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Fine-tuning and inference can be co-executed on shared replicas with real-time parameter reuse via unmerged inference and shadow adapters without unacceptable overhead.
This premise underpins the intra-replica mechanism and is required for the claimed efficiency gains.

pith-pipeline@v0.9.0 · 5796 in / 1185 out tokens · 53182 ms · 2026-05-21T10:29:05.904341+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

intra-replica model sharing mechanism that enables real-time model parameter reuse via unmerged inference and shadow adapter strategies
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-timescale inter-replica coordination algorithm that adaptively balances fine-tuning and inference workloads

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 1 internal anchor

[1]

(2023) Github copilot

GitHub Copilot. (2023) Github copilot. [Online]. Available: https: //github.com/features/copilot

work page 2023
[2]

Y . Mehdi. (2023) Reinventing search with a new ai-powered microsoft bing and edge, your copilot for the web. [Online]. Available: https://blogs.microsoft.com/blog/2023/02/07/

work page 2023
[3]

(2022) Chatgpt: Optimizing language models for dialogue

OpenAI. (2022) Chatgpt: Optimizing language models for dialogue. [Online]. Available: https://openai.com/blog/chatgpt/

work page 2022
[4]

A review on edge large language models: Design, execution, and applications,

Y . Zheng, Y . Chen, B. Qian, X. Shi, Y . Shu, and J. Chen, “A review on edge large language models: Design, execution, and applications,” ACM Comput. Surv., vol. 57, no. 8, Mar. 2025. [Online]. Available: https://doi.org/10.1145/3719664

work page doi:10.1145/3719664 2025
[5]

Mobile edge intelligence for large language models: A contemporary survey,

G. Qu, Q. Chen, W. Wei, Z. Lin, X. Chen, and K. Huang, “Mobile edge intelligence for large language models: A contemporary survey,” IEEE Communications Surveys & Tutorials, 2025

work page 2025
[6]

Exploring parameter-efficient fine-tuning to enable foundation models in feder- ated learning,

G. Sun, U. Khalid, M. Mendieta, P. Wang, and C. Chen, “Exploring parameter-efficient fine-tuning to enable foundation models in feder- ated learning,” in 2024 IEEE International Conference on Big Data (BigData), 2024, pp. 8015–8024

work page 2024
[7]

Heterogeneous LoRA for federated fine-tuning of on-device foundation models,

Y . J. Cho, L. Liu, Z. Xu, A. Fahrezi, and G. Joshi, “Heterogeneous LoRA for federated fine-tuning of on-device foundation models,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 12 903–12 913. [Online]. Available: https://aclantholog...

work page 2024
[8]

Federated fine-tuning for pre-trained foundation models over wireless networks,

Z. Wang, Y . Zhou, Y . Shi, and K. B. Letaief, “Federated fine-tuning for pre-trained foundation models over wireless networks,” Trans. Wireless. Comm., vol. 24, no. 4, p. 3450–3464, Jan. 2025. [Online]. Available: https://doi.org/10.1109/TWC.2025.3531128

work page doi:10.1109/twc.2025.3531128 2025
[9]

LoRA: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=nZeVKeeFYf9

work page 2022
[10]

Beyond scale: the diversity coefficient as a data quality metric demonstrates llms are pre-trained on formally diverse data,

A. Lee, B. Miranda, and S. Koyejo, “Beyond scale: the diversity coefficient as a data quality metric demonstrates llms are pre-trained on formally diverse data,” Proc. Int. Conf. Mach. Learn. (ICML), 2023

work page 2023
[11]

DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster,

Z. Chen, X. Zhao, C. Zhi, and J. Yin, “DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster,”IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 9, pp. 2553–2567, 2023

work page 2023
[12]

Multiplexing dynamic deep learning workloads with slo-awareness in gpu clusters,

W. Chen, C. Lu, H. Xu, K. Ye, and C. Xu, “Multiplexing dynamic deep learning workloads with slo-awareness in gpu clusters,” in Proceedings of the Twentieth European Conference on Computer Systems, ser. EuroSys ’25. New York, NY , USA: Association for Computing Machinery, 2025. [Online]. Available: https://doi.org/10.1145/3689031. 3696074

work page doi:10.1145/3689031 2025
[13]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[14]

Lyra: Elastic scheduling for deep learning clusters,

J. Li, H. Xu, Y . Zhu, Z. Liu, C. Guo, and C. Wang, “Lyra: Elastic scheduling for deep learning clusters,” in Proceedings of the Eighteenth European Conference on Computer Systems, ser. EuroSys ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 835–850. [Online]. Available: https://doi.org/10.1145/3552326.3587445

work page doi:10.1145/3552326.3587445 2023
[15]

Serving hetero- geneous machine learning models on Multi-GPU servers with Spatio- Temporal sharing,

S. Choi, S. Lee, Y . Kim, J. Park, Y . Kwon, and J. Huh, “Serving hetero- geneous machine learning models on Multi-GPU servers with Spatio- Temporal sharing,” in 2022 USENIX Annual Technical Conference (USENIX ATC 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 199–216

work page 2022
[16]

(2023) Multi-process service (mps)

NVIDIA. (2023) Multi-process service (mps). [Online]. Available: https://docs.nvidia.com/deploy/mps/index.html

work page 2023
[17]

Shepherd : Serving DNNs in the Wild,

H. Zhang, Y . Tang, A. Khandelwal, I. Stoica, and U. C. Berkeley, “Shepherd : Serving DNNs in the Wild,” NSDI, 2023

work page 2023
[18]

Federated Learning while Pro- viding Model as a Service: Joint Training and Inference Optimization,

P. Han, S. Wang, Y . Jiao, and J. Huang, “Federated Learning while Pro- viding Model as a Service: Joint Training and Inference Optimization,” Proceedings - IEEE INFOCOM, pp. 631–640, 2024

work page 2024
[19]

Communication-efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics. PMLR, 2017

work page 2017
[20]

Fedadapt: Adaptive offloading for iot devices in federated learning,

D. Wu, R. Ullah, P. Harvey, P. Kilpatrick, I. Spence, and B. Varghese, “Fedadapt: Adaptive offloading for iot devices in federated learning,” IEEE Internet of Things Journal, vol. 9, no. 21, 2022

work page 2022
[21]

Qlora: Efficient finetuning of quantized llms,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” Advances in neural information processing systems, vol. 36, pp. 10 088–10 115, 2023

work page 2023
[22]

Fedpara: Low-rank hadamard product for communication-efficient federated learning.arXiv preprint arXiv: 2108.06098,

N. Hyeon-Woo, M. Ye-Bin, and T.-H. Oh, “Fedpara: Low-rank hadamard product for communication-efficient federated learning,” arXiv preprint arXiv:2108.06098, 2021

work page arXiv 2021
[23]

, Tam, D

H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, and C. Raffel, “Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning,” 2022. [Online]. Available: https://arxiv.org/abs/2205.05638

work page arXiv 2022
[24]

Don’t decay the learning rate, increase the batch size,

S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V . Le, “Don’t decay the learning rate, increase the batch size,” ICLR, 2018

work page 2018
[25]

Efficient Coordination of Federated Learning and Inference Offloading at the Edge: A Proactive Optimization Paradigm,

K. Luo, K. Zhao, T. Ouyang, X. Zhang, Z. Zhou, H. Wang, and X. Chen, “Efficient Coordination of Federated Learning and Inference Offloading at the Edge: A Proactive Optimization Paradigm,” IEEE Transactions on Mobile Computing, vol. PP, pp. 1–15, 2024

work page 2024
[26]

Human-in-the-loop machine learning: a state of the art,

E. Mosqueira-Rey, E. Hern ´andez-Pereira, D. Alonso-R ´ıos, J. Bobes- Bascar´an, and A. Fern ´andez-Leal, “Human-in-the-loop machine learning: a state of the art,” Artif. Intell. Rev., vol. 56, no. 4, p. 3005–3054, Aug. 2022. [Online]. Available: https://doi.org/10.1007/s10462-022-10246-w

work page doi:10.1007/s10462-022-10246-w 2022
[27]

Illustrating reinforcement learning from human feedback (rlhf),

N. Lambert, L. Castricato, L. von Werra, and A. Havrilla, “Illustrating reinforcement learning from human feedback (rlhf),” Hugging Face Blog, 2022, https://huggingface.co/blog/rlhf

work page 2022
[28]

manim code,

thanhkt, “manim code,” https://huggingface.co/datasets/thanhkt/manim code

work page
[29]

Codealpaca-20k,

sahil2801, “Codealpaca-20k,” https://huggingface.co/datasets/sahil2801/ CodeAlpaca-20k

work page
[30]

code instructions 120k alpaca,

iamtarun, “code instructions 120k alpaca,” https://huggingface.co/ datasets/iamtarun/code instructions 120k alpaca

work page
[31]

tatsu-lab, “alpaca,” https://huggingface.co/datasets/tatsu-lab/alpaca

work page
[32]

Gpteacher-general-instruct,

teknium, “Gpteacher-general-instruct,” https://huggingface.co/datasets/ teknium/GPTeacher-General-Instruct

work page
[33]

open-instruct-v1,

hakurei, “open-instruct-v1,” https://huggingface.co/datasets/hakurei/ open-instruct-v1

work page
[34]

URL https://doi.org/10.1109/HPCA61900.2025.00102

J. Stojkovic, C. Zhang, I. Goiri, J. Torrellas, and E. Choukse, “ DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency ,” in 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). Los Alamitos, CA, USA: IEEE Computer Society, Mar. 2025, pp. 1348–1362. [Online]. Available: https://doi.ieeecomputerso...

work page doi:10.1109/hpca61900.2025.00102 2025
[35]

dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving,

B. Wu, R. Zhu, Z. Zhang, P. Sun, X. Liu, and X. Jin, “dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving,” OSDI, pp. 911–927, 2024

work page 2024
[36]

Peft: State-of-the-art parameter-efficient fine-tuning methods,

S. Mangrulkar, S. Gugger, L. Debut, Y . Belkada, S. Paul, and B. Bossan, “Peft: State-of-the-art parameter-efficient fine-tuning methods,” https:// github.com/huggingface/peft, 2022

work page 2022
[37]

Accelerating End-Cloud Collaborative Inference via Near Bubble-free Pipeline Opti- mization,

L. Gao, J. Liu, H. Xu, S. Xu, Q. Ma, and L. Huang, “Accelerating End-Cloud Collaborative Inference via Near Bubble-free Pipeline Opti- mization,” INFOCOM2025, 2025

work page 2025
[38]

Adaptive parameter-efficient federated fine-tuning on heterogeneous devices,

J. Liu, Y . Liao, H. Xu, Y . Xu, J. Liu, and C. Qian, “Adaptive parameter-efficient federated fine-tuning on heterogeneous devices,” IEEE Transactions on Mobile Computing, vol. 24, no. 11, pp. 12 533– 12 549, 2025

work page 2025
[39]

Haflq: Heterogeneous adaptive federated lora fine-tuned llm with quantization,

Y . Su, N. Yan, Y . Deng, M. Dohler, and R. Schober, “Haflq: Heterogeneous adaptive federated lora fine-tuned llm with quantization,”

work page
[40]

Available: https://arxiv.org/abs/2411.06581

[Online]. Available: https://arxiv.org/abs/2411.06581

work page arXiv
[41]

FwdLLM: Efficient Feder- ated Finetuning of Large Language Models with Perturbed Inferences,

M. Xu, D. Cai, Y . Wu, X. Li, and S. Wang, “FwdLLM: Efficient Feder- ated Finetuning of Large Language Models with Perturbed Inferences,” Proceedings of the 2024 USENIX Annual Technical Conference, ATC 2024, pp. 579–596, 2024

work page 2024
[42]

Partitioned collaborative inference for on-device models via evolution- ary reinforcement learning,

L. Tan, P. Zhou, S. Guo, J. Zhao, Z. Kuang, D. Qiao, and L. Yang, “Partitioned collaborative inference for on-device models via evolution- ary reinforcement learning,” in2025 IEEE 45th International Conference on Distributed Computing Systems (ICDCS), 2025, pp. 714–724

work page 2025
[43]

Z. Shen, Y . He, Z. Wang, Y . Zhang, G. Sun, W. Ye, and A. Li, EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices. MobiSys, 2025, vol. 1, no. 1

work page 2025
[44]

Online Resource Allocation for Edge Intelligence with Colocated Model Retraining and Inference,

H. Cai, Z. Zhou, and Q. Huang, “Online Resource Allocation for Edge Intelligence with Colocated Model Retraining and Inference,” Proceedings - IEEE INFOCOM, pp. 1900–1909, 2024

work page 1900
[45]

LLMStation: Resource Multiplexing in Tuning and Serving Large Language Models,

Y . He, H. Yang, Y . Lu, A. Klimovic, and G. Alonso, “LLMStation: Resource Multiplexing in Tuning and Serving Large Language Models,” Proceedings of the 2025 USENIX Annual Technical Conference, ATC 2025, pp. 1639–1655, 2025

work page 2025
[46]

Flexllm: Token-level co-serving of llm inference and finetuning with slo guarantees,

G. Oliaro, X. Miao, X. Cheng, V . Kada, M. Wu, R. Gao, Y . Huang, R. Delacourt, A. Yang, Y . Wang, C. Unger, and Z. Jia, “Flexllm: Token-level co-serving of llm inference and finetuning with slo guarantees,” in The 23rd USENIX Symposium on Networked Systems Design and Implementation, 2026. [Online]. Available: https://arxiv.org/abs/2402.18789

work page arXiv 2026

[1] [1]

(2023) Github copilot

GitHub Copilot. (2023) Github copilot. [Online]. Available: https: //github.com/features/copilot

work page 2023

[2] [2]

Y . Mehdi. (2023) Reinventing search with a new ai-powered microsoft bing and edge, your copilot for the web. [Online]. Available: https://blogs.microsoft.com/blog/2023/02/07/

work page 2023

[3] [3]

(2022) Chatgpt: Optimizing language models for dialogue

OpenAI. (2022) Chatgpt: Optimizing language models for dialogue. [Online]. Available: https://openai.com/blog/chatgpt/

work page 2022

[4] [4]

A review on edge large language models: Design, execution, and applications,

Y . Zheng, Y . Chen, B. Qian, X. Shi, Y . Shu, and J. Chen, “A review on edge large language models: Design, execution, and applications,” ACM Comput. Surv., vol. 57, no. 8, Mar. 2025. [Online]. Available: https://doi.org/10.1145/3719664

work page doi:10.1145/3719664 2025

[5] [5]

Mobile edge intelligence for large language models: A contemporary survey,

G. Qu, Q. Chen, W. Wei, Z. Lin, X. Chen, and K. Huang, “Mobile edge intelligence for large language models: A contemporary survey,” IEEE Communications Surveys & Tutorials, 2025

work page 2025

[6] [6]

Exploring parameter-efficient fine-tuning to enable foundation models in feder- ated learning,

G. Sun, U. Khalid, M. Mendieta, P. Wang, and C. Chen, “Exploring parameter-efficient fine-tuning to enable foundation models in feder- ated learning,” in 2024 IEEE International Conference on Big Data (BigData), 2024, pp. 8015–8024

work page 2024

[7] [7]

Heterogeneous LoRA for federated fine-tuning of on-device foundation models,

Y . J. Cho, L. Liu, Z. Xu, A. Fahrezi, and G. Joshi, “Heterogeneous LoRA for federated fine-tuning of on-device foundation models,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 12 903–12 913. [Online]. Available: https://aclantholog...

work page 2024

[8] [8]

Federated fine-tuning for pre-trained foundation models over wireless networks,

Z. Wang, Y . Zhou, Y . Shi, and K. B. Letaief, “Federated fine-tuning for pre-trained foundation models over wireless networks,” Trans. Wireless. Comm., vol. 24, no. 4, p. 3450–3464, Jan. 2025. [Online]. Available: https://doi.org/10.1109/TWC.2025.3531128

work page doi:10.1109/twc.2025.3531128 2025

[9] [9]

LoRA: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=nZeVKeeFYf9

work page 2022

[10] [10]

Beyond scale: the diversity coefficient as a data quality metric demonstrates llms are pre-trained on formally diverse data,

A. Lee, B. Miranda, and S. Koyejo, “Beyond scale: the diversity coefficient as a data quality metric demonstrates llms are pre-trained on formally diverse data,” Proc. Int. Conf. Mach. Learn. (ICML), 2023

work page 2023

[11] [11]

DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster,

Z. Chen, X. Zhao, C. Zhi, and J. Yin, “DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster,”IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 9, pp. 2553–2567, 2023

work page 2023

[12] [12]

Multiplexing dynamic deep learning workloads with slo-awareness in gpu clusters,

W. Chen, C. Lu, H. Xu, K. Ye, and C. Xu, “Multiplexing dynamic deep learning workloads with slo-awareness in gpu clusters,” in Proceedings of the Twentieth European Conference on Computer Systems, ser. EuroSys ’25. New York, NY , USA: Association for Computing Machinery, 2025. [Online]. Available: https://doi.org/10.1145/3689031. 3696074

work page doi:10.1145/3689031 2025

[13] [13]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[14] [14]

Lyra: Elastic scheduling for deep learning clusters,

J. Li, H. Xu, Y . Zhu, Z. Liu, C. Guo, and C. Wang, “Lyra: Elastic scheduling for deep learning clusters,” in Proceedings of the Eighteenth European Conference on Computer Systems, ser. EuroSys ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 835–850. [Online]. Available: https://doi.org/10.1145/3552326.3587445

work page doi:10.1145/3552326.3587445 2023

[15] [15]

Serving hetero- geneous machine learning models on Multi-GPU servers with Spatio- Temporal sharing,

S. Choi, S. Lee, Y . Kim, J. Park, Y . Kwon, and J. Huh, “Serving hetero- geneous machine learning models on Multi-GPU servers with Spatio- Temporal sharing,” in 2022 USENIX Annual Technical Conference (USENIX ATC 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 199–216

work page 2022

[16] [16]

(2023) Multi-process service (mps)

NVIDIA. (2023) Multi-process service (mps). [Online]. Available: https://docs.nvidia.com/deploy/mps/index.html

work page 2023

[17] [17]

Shepherd : Serving DNNs in the Wild,

H. Zhang, Y . Tang, A. Khandelwal, I. Stoica, and U. C. Berkeley, “Shepherd : Serving DNNs in the Wild,” NSDI, 2023

work page 2023

[18] [18]

Federated Learning while Pro- viding Model as a Service: Joint Training and Inference Optimization,

P. Han, S. Wang, Y . Jiao, and J. Huang, “Federated Learning while Pro- viding Model as a Service: Joint Training and Inference Optimization,” Proceedings - IEEE INFOCOM, pp. 631–640, 2024

work page 2024

[19] [19]

Communication-efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics. PMLR, 2017

work page 2017

[20] [20]

Fedadapt: Adaptive offloading for iot devices in federated learning,

D. Wu, R. Ullah, P. Harvey, P. Kilpatrick, I. Spence, and B. Varghese, “Fedadapt: Adaptive offloading for iot devices in federated learning,” IEEE Internet of Things Journal, vol. 9, no. 21, 2022

work page 2022

[21] [21]

Qlora: Efficient finetuning of quantized llms,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” Advances in neural information processing systems, vol. 36, pp. 10 088–10 115, 2023

work page 2023

[22] [22]

Fedpara: Low-rank hadamard product for communication-efficient federated learning.arXiv preprint arXiv: 2108.06098,

N. Hyeon-Woo, M. Ye-Bin, and T.-H. Oh, “Fedpara: Low-rank hadamard product for communication-efficient federated learning,” arXiv preprint arXiv:2108.06098, 2021

work page arXiv 2021

[23] [23]

, Tam, D

H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, and C. Raffel, “Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning,” 2022. [Online]. Available: https://arxiv.org/abs/2205.05638

work page arXiv 2022

[24] [24]

Don’t decay the learning rate, increase the batch size,

S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V . Le, “Don’t decay the learning rate, increase the batch size,” ICLR, 2018

work page 2018

[25] [25]

Efficient Coordination of Federated Learning and Inference Offloading at the Edge: A Proactive Optimization Paradigm,

K. Luo, K. Zhao, T. Ouyang, X. Zhang, Z. Zhou, H. Wang, and X. Chen, “Efficient Coordination of Federated Learning and Inference Offloading at the Edge: A Proactive Optimization Paradigm,” IEEE Transactions on Mobile Computing, vol. PP, pp. 1–15, 2024

work page 2024

[26] [26]

Human-in-the-loop machine learning: a state of the art,

E. Mosqueira-Rey, E. Hern ´andez-Pereira, D. Alonso-R ´ıos, J. Bobes- Bascar´an, and A. Fern ´andez-Leal, “Human-in-the-loop machine learning: a state of the art,” Artif. Intell. Rev., vol. 56, no. 4, p. 3005–3054, Aug. 2022. [Online]. Available: https://doi.org/10.1007/s10462-022-10246-w

work page doi:10.1007/s10462-022-10246-w 2022

[27] [27]

Illustrating reinforcement learning from human feedback (rlhf),

N. Lambert, L. Castricato, L. von Werra, and A. Havrilla, “Illustrating reinforcement learning from human feedback (rlhf),” Hugging Face Blog, 2022, https://huggingface.co/blog/rlhf

work page 2022

[28] [28]

manim code,

thanhkt, “manim code,” https://huggingface.co/datasets/thanhkt/manim code

work page

[29] [29]

Codealpaca-20k,

sahil2801, “Codealpaca-20k,” https://huggingface.co/datasets/sahil2801/ CodeAlpaca-20k

work page

[30] [30]

code instructions 120k alpaca,

iamtarun, “code instructions 120k alpaca,” https://huggingface.co/ datasets/iamtarun/code instructions 120k alpaca

work page

[31] [31]

tatsu-lab, “alpaca,” https://huggingface.co/datasets/tatsu-lab/alpaca

work page

[32] [32]

Gpteacher-general-instruct,

teknium, “Gpteacher-general-instruct,” https://huggingface.co/datasets/ teknium/GPTeacher-General-Instruct

work page

[33] [33]

open-instruct-v1,

hakurei, “open-instruct-v1,” https://huggingface.co/datasets/hakurei/ open-instruct-v1

work page

[34] [34]

URL https://doi.org/10.1109/HPCA61900.2025.00102

J. Stojkovic, C. Zhang, I. Goiri, J. Torrellas, and E. Choukse, “ DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency ,” in 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). Los Alamitos, CA, USA: IEEE Computer Society, Mar. 2025, pp. 1348–1362. [Online]. Available: https://doi.ieeecomputerso...

work page doi:10.1109/hpca61900.2025.00102 2025

[35] [35]

dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving,

B. Wu, R. Zhu, Z. Zhang, P. Sun, X. Liu, and X. Jin, “dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving,” OSDI, pp. 911–927, 2024

work page 2024

[36] [36]

Peft: State-of-the-art parameter-efficient fine-tuning methods,

S. Mangrulkar, S. Gugger, L. Debut, Y . Belkada, S. Paul, and B. Bossan, “Peft: State-of-the-art parameter-efficient fine-tuning methods,” https:// github.com/huggingface/peft, 2022

work page 2022

[37] [37]

Accelerating End-Cloud Collaborative Inference via Near Bubble-free Pipeline Opti- mization,

L. Gao, J. Liu, H. Xu, S. Xu, Q. Ma, and L. Huang, “Accelerating End-Cloud Collaborative Inference via Near Bubble-free Pipeline Opti- mization,” INFOCOM2025, 2025

work page 2025

[38] [38]

Adaptive parameter-efficient federated fine-tuning on heterogeneous devices,

J. Liu, Y . Liao, H. Xu, Y . Xu, J. Liu, and C. Qian, “Adaptive parameter-efficient federated fine-tuning on heterogeneous devices,” IEEE Transactions on Mobile Computing, vol. 24, no. 11, pp. 12 533– 12 549, 2025

work page 2025

[39] [39]

Haflq: Heterogeneous adaptive federated lora fine-tuned llm with quantization,

Y . Su, N. Yan, Y . Deng, M. Dohler, and R. Schober, “Haflq: Heterogeneous adaptive federated lora fine-tuned llm with quantization,”

work page

[40] [40]

Available: https://arxiv.org/abs/2411.06581

[Online]. Available: https://arxiv.org/abs/2411.06581

work page arXiv

[41] [41]

FwdLLM: Efficient Feder- ated Finetuning of Large Language Models with Perturbed Inferences,

M. Xu, D. Cai, Y . Wu, X. Li, and S. Wang, “FwdLLM: Efficient Feder- ated Finetuning of Large Language Models with Perturbed Inferences,” Proceedings of the 2024 USENIX Annual Technical Conference, ATC 2024, pp. 579–596, 2024

work page 2024

[42] [42]

Partitioned collaborative inference for on-device models via evolution- ary reinforcement learning,

L. Tan, P. Zhou, S. Guo, J. Zhao, Z. Kuang, D. Qiao, and L. Yang, “Partitioned collaborative inference for on-device models via evolution- ary reinforcement learning,” in2025 IEEE 45th International Conference on Distributed Computing Systems (ICDCS), 2025, pp. 714–724

work page 2025

[43] [43]

Z. Shen, Y . He, Z. Wang, Y . Zhang, G. Sun, W. Ye, and A. Li, EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices. MobiSys, 2025, vol. 1, no. 1

work page 2025

[44] [44]

Online Resource Allocation for Edge Intelligence with Colocated Model Retraining and Inference,

H. Cai, Z. Zhou, and Q. Huang, “Online Resource Allocation for Edge Intelligence with Colocated Model Retraining and Inference,” Proceedings - IEEE INFOCOM, pp. 1900–1909, 2024

work page 1900

[45] [45]

LLMStation: Resource Multiplexing in Tuning and Serving Large Language Models,

Y . He, H. Yang, Y . Lu, A. Klimovic, and G. Alonso, “LLMStation: Resource Multiplexing in Tuning and Serving Large Language Models,” Proceedings of the 2025 USENIX Annual Technical Conference, ATC 2025, pp. 1639–1655, 2025

work page 2025

[46] [46]

Flexllm: Token-level co-serving of llm inference and finetuning with slo guarantees,

G. Oliaro, X. Miao, X. Cheng, V . Kada, M. Wu, R. Gao, Y . Huang, R. Delacourt, A. Yang, Y . Wang, C. Unger, and Z. Jia, “Flexllm: Token-level co-serving of llm inference and finetuning with slo guarantees,” in The 23rd USENIX Symposium on Networked Systems Design and Implementation, 2026. [Online]. Available: https://arxiv.org/abs/2402.18789

work page arXiv 2026