CoLLM: Continuous Adaptation for SLO-Aware LLM Serving on Shared GPU Clusters
Pith reviewed 2026-05-21 10:29 UTC · model grok-4.3
The pith
CoLLM unifies fine-tuning and inference on shared edge GPU replicas so that model updates improve serving quality without extra deployments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoLLM shows that a unified co-execution framework on shared replicas, built from unmerged inference plus shadow adapters inside each replica and a two-timescale balancing algorithm between replicas, can jointly raise long-term model quality and short-term inference efficiency, delivering up to three times the goodput of systems that isolate the two workloads.
What carries the argument
Intra-replica model sharing via unmerged inference and shadow adapter strategies, paired with a two-timescale inter-replica coordination algorithm that balances workloads.
If this is right
- Fewer total model copies are needed because one replica set supports both ongoing adaptation and live serving.
- Inference quality improves as soon as fine-tuning updates arrive rather than after a separate deployment step.
- Service-level objectives for latency and throughput can be maintained while model quality still advances over time.
- Edge deployments of domain-specific LLMs become faster because training and serving share the same resource footprint.
Where Pith is reading between the lines
- The same sharing pattern could collapse separate training and prediction pools in other paired workloads such as online learning for recommendation systems.
- Operators might cut overall GPU hours by merging what are now treated as two independent resource classes.
- Extensions could allow multiple simultaneous adaptation streams to run on one replica while preserving the same balancing logic.
Load-bearing premise
Unmerged inference and shadow adapters can reuse parameters in real time without unacceptable overhead or correctness issues on the shared replicas.
What would settle it
A side-by-side measurement of end-to-end latency and accuracy when fine-tuning runs on the same replica as inference versus when the two tasks use entirely separate replicas.
Figures
read the original abstract
As Large Language Models (LLMs) are increasingly adopted in edge intelligence to power domain-specific applications and personalized services, the quality and efficiency of the LLM post-training phase-including fine-tuning and inference, have become critical due to constrained resources. Although recent advances in federated parameter-efficient fine-tuning (FL PEFT) and low-latency inference have improved individual task performance, fine-tuning and inference are still handled as isolated workloads, which overlooks their interdependence and results in redundant deployments and delayed improvement in inference quality. To address these limitations, we introduce a new co-execution framework and instantiate it with CoLLM, a system that unifies FL PEFT and inference on shared edge replicas and model parameters. CoLLM addresses key challenges at both replica and cluster levels through: (1) an intra-replica model sharing mechanism that enables real-time model parameter reuse via unmerged inference and shadow adapter strategies; and (2) a two-timescale inter-replica coordination algorithm that adaptively balances fine-tuning and inference workloads to jointly optimize long-term model quality gains and short-term inference efficiency. Extensive evaluation across diverse LLMs and real-world traces show that CoLLM consistently outperforms state-of-the-art LLM systems, achieving up to 3x higher goodput, demonstrating its effectiveness in enabling seamless LLM post-training for edge intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents CoLLM, a system for continuous adaptation of LLMs on shared GPU clusters. It unifies federated parameter-efficient fine-tuning (FL PEFT) and inference workloads on shared replicas using an intra-replica model sharing mechanism based on unmerged inference and shadow adapters, combined with a two-timescale inter-replica coordination algorithm. The central claim is that this approach enables seamless post-training while achieving up to 3x higher goodput than state-of-the-art LLM serving systems, as demonstrated through evaluations on diverse LLMs and real-world traces.
Significance. If the performance claims hold and the sharing mechanism preserves inference quality, this work could have significant impact on edge intelligence by allowing joint optimization of model quality and serving efficiency without redundant resource allocation. The explicit handling of the interdependence between fine-tuning and inference is a strength, and the use of real-world traces adds to the practical relevance. However, the lack of detailed validation for the core mechanism limits the current assessment of its broader implications.
major comments (2)
- [Section 4.1] The intra-replica model sharing mechanism via unmerged inference and shadow adapters is described, but there is no explicit validation or equivalence check (e.g., output distribution similarity or downstream task accuracy) comparing it to standard merged-adapter serving after PEFT updates. This is load-bearing for the goodput claim, as any degradation in model quality could offset the reported efficiency gains.
- [Section 5] The evaluation section reports up to 3x goodput improvements but provides insufficient details on the baselines compared against, statistical error bars, specific characteristics of the workload traces, and criteria for selecting the diverse LLMs. This undermines the verifiability of the central performance claim.
minor comments (2)
- [Abstract] The abstract mentions 'goodput' without a brief definition; consider adding one for clarity to readers unfamiliar with the term in this context.
- [Throughout] Ensure consistent use of acronyms like FL PEFT upon first use in the main text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below with point-by-point responses and commit to revisions that strengthen the presentation without altering our core claims.
read point-by-point responses
-
Referee: [Section 4.1] The intra-replica model sharing mechanism via unmerged inference and shadow adapters is described, but there is no explicit validation or equivalence check (e.g., output distribution similarity or downstream task accuracy) comparing it to standard merged-adapter serving after PEFT updates. This is load-bearing for the goodput claim, as any degradation in model quality could offset the reported efficiency gains.
Authors: We acknowledge the value of explicit empirical validation for this load-bearing mechanism. By design, unmerged inference with shadow adapters computes identical outputs to merged-adapter inference because the adapter weights are applied identically in the forward pass without requiring an explicit merge step; this equivalence follows directly from the linear nature of the adapter updates. To address the referee's concern directly, we will add a dedicated subsection in Section 4.1 (or an appendix) reporting output distribution similarity metrics such as KL divergence and cosine similarity on token logits, along with downstream task accuracy comparisons on standard benchmarks for the LLMs evaluated in the paper. revision: yes
-
Referee: [Section 5] The evaluation section reports up to 3x goodput improvements but provides insufficient details on the baselines compared against, statistical error bars, specific characteristics of the workload traces, and criteria for selecting the diverse LLMs. This undermines the verifiability of the central performance claim.
Authors: We agree that expanded details will improve verifiability and reproducibility. In the revised manuscript we will augment Section 5 with: explicit identification and configuration details for all baselines (including vLLM, TensorRT-LLM, and other SLO-aware systems); statistical error bars and standard deviations computed over at least five independent runs; precise characteristics of the workload traces (source, request-rate distributions, SLO definitions, and durations); and the selection criteria for the LLMs (parameter scale, architecture family, and public availability). These additions will be presented in both the main text and a new reproducibility table. revision: yes
Circularity Check
No circularity: claims rest on empirical evaluation, not self-referential definitions or fitted predictions
full rationale
The paper presents CoLLM as a co-execution framework with two concrete mechanisms (intra-replica unmerged inference plus shadow adapters, and a two-timescale coordination algorithm) whose effectiveness is asserted via extensive evaluation on diverse LLMs and real-world traces, reporting up to 3x goodput gains. No equations, fitted parameters, or first-principles derivations appear in the provided text that reduce the reported outcomes to the inputs by construction. The central performance claim is therefore falsifiable against external baselines and does not rely on self-citation chains or renaming of known results. This is the normal, non-circular outcome for a systems paper whose load-bearing evidence is experimental rather than deductive.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Fine-tuning and inference can be co-executed on shared replicas with real-time parameter reuse via unmerged inference and shadow adapters without unacceptable overhead.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
intra-replica model sharing mechanism that enables real-time model parameter reuse via unmerged inference and shadow adapter strategies
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two-timescale inter-replica coordination algorithm that adaptively balances fine-tuning and inference workloads
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
GitHub Copilot. (2023) Github copilot. [Online]. Available: https: //github.com/features/copilot
work page 2023
-
[2]
Y . Mehdi. (2023) Reinventing search with a new ai-powered microsoft bing and edge, your copilot for the web. [Online]. Available: https://blogs.microsoft.com/blog/2023/02/07/
work page 2023
-
[3]
(2022) Chatgpt: Optimizing language models for dialogue
OpenAI. (2022) Chatgpt: Optimizing language models for dialogue. [Online]. Available: https://openai.com/blog/chatgpt/
work page 2022
-
[4]
A review on edge large language models: Design, execution, and applications,
Y . Zheng, Y . Chen, B. Qian, X. Shi, Y . Shu, and J. Chen, “A review on edge large language models: Design, execution, and applications,” ACM Comput. Surv., vol. 57, no. 8, Mar. 2025. [Online]. Available: https://doi.org/10.1145/3719664
-
[5]
Mobile edge intelligence for large language models: A contemporary survey,
G. Qu, Q. Chen, W. Wei, Z. Lin, X. Chen, and K. Huang, “Mobile edge intelligence for large language models: A contemporary survey,” IEEE Communications Surveys & Tutorials, 2025
work page 2025
-
[6]
Exploring parameter-efficient fine-tuning to enable foundation models in feder- ated learning,
G. Sun, U. Khalid, M. Mendieta, P. Wang, and C. Chen, “Exploring parameter-efficient fine-tuning to enable foundation models in feder- ated learning,” in 2024 IEEE International Conference on Big Data (BigData), 2024, pp. 8015–8024
work page 2024
-
[7]
Heterogeneous LoRA for federated fine-tuning of on-device foundation models,
Y . J. Cho, L. Liu, Z. Xu, A. Fahrezi, and G. Joshi, “Heterogeneous LoRA for federated fine-tuning of on-device foundation models,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 12 903–12 913. [Online]. Available: https://aclantholog...
work page 2024
-
[8]
Federated fine-tuning for pre-trained foundation models over wireless networks,
Z. Wang, Y . Zhou, Y . Shi, and K. B. Letaief, “Federated fine-tuning for pre-trained foundation models over wireless networks,” Trans. Wireless. Comm., vol. 24, no. 4, p. 3450–3464, Jan. 2025. [Online]. Available: https://doi.org/10.1109/TWC.2025.3531128
-
[9]
LoRA: Low-rank adaptation of large language models,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=nZeVKeeFYf9
work page 2022
-
[10]
A. Lee, B. Miranda, and S. Koyejo, “Beyond scale: the diversity coefficient as a data quality metric demonstrates llms are pre-trained on formally diverse data,” Proc. Int. Conf. Mach. Learn. (ICML), 2023
work page 2023
-
[11]
DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster,
Z. Chen, X. Zhao, C. Zhi, and J. Yin, “DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster,”IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 9, pp. 2553–2567, 2023
work page 2023
-
[12]
Multiplexing dynamic deep learning workloads with slo-awareness in gpu clusters,
W. Chen, C. Lu, H. Xu, K. Ye, and C. Xu, “Multiplexing dynamic deep learning workloads with slo-awareness in gpu clusters,” in Proceedings of the Twentieth European Conference on Computer Systems, ser. EuroSys ’25. New York, NY , USA: Association for Computing Machinery, 2025. [Online]. Available: https://doi.org/10.1145/3689031. 3696074
-
[13]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[14]
Lyra: Elastic scheduling for deep learning clusters,
J. Li, H. Xu, Y . Zhu, Z. Liu, C. Guo, and C. Wang, “Lyra: Elastic scheduling for deep learning clusters,” in Proceedings of the Eighteenth European Conference on Computer Systems, ser. EuroSys ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 835–850. [Online]. Available: https://doi.org/10.1145/3552326.3587445
-
[15]
Serving hetero- geneous machine learning models on Multi-GPU servers with Spatio- Temporal sharing,
S. Choi, S. Lee, Y . Kim, J. Park, Y . Kwon, and J. Huh, “Serving hetero- geneous machine learning models on Multi-GPU servers with Spatio- Temporal sharing,” in 2022 USENIX Annual Technical Conference (USENIX ATC 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 199–216
work page 2022
-
[16]
(2023) Multi-process service (mps)
NVIDIA. (2023) Multi-process service (mps). [Online]. Available: https://docs.nvidia.com/deploy/mps/index.html
work page 2023
-
[17]
Shepherd : Serving DNNs in the Wild,
H. Zhang, Y . Tang, A. Khandelwal, I. Stoica, and U. C. Berkeley, “Shepherd : Serving DNNs in the Wild,” NSDI, 2023
work page 2023
-
[18]
Federated Learning while Pro- viding Model as a Service: Joint Training and Inference Optimization,
P. Han, S. Wang, Y . Jiao, and J. Huang, “Federated Learning while Pro- viding Model as a Service: Joint Training and Inference Optimization,” Proceedings - IEEE INFOCOM, pp. 631–640, 2024
work page 2024
-
[19]
Communication-efficient learning of deep networks from decentralized data,
B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics. PMLR, 2017
work page 2017
-
[20]
Fedadapt: Adaptive offloading for iot devices in federated learning,
D. Wu, R. Ullah, P. Harvey, P. Kilpatrick, I. Spence, and B. Varghese, “Fedadapt: Adaptive offloading for iot devices in federated learning,” IEEE Internet of Things Journal, vol. 9, no. 21, 2022
work page 2022
-
[21]
Qlora: Efficient finetuning of quantized llms,
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” Advances in neural information processing systems, vol. 36, pp. 10 088–10 115, 2023
work page 2023
-
[22]
N. Hyeon-Woo, M. Ye-Bin, and T.-H. Oh, “Fedpara: Low-rank hadamard product for communication-efficient federated learning,” arXiv preprint arXiv:2108.06098, 2021
- [23]
-
[24]
Don’t decay the learning rate, increase the batch size,
S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V . Le, “Don’t decay the learning rate, increase the batch size,” ICLR, 2018
work page 2018
-
[25]
K. Luo, K. Zhao, T. Ouyang, X. Zhang, Z. Zhou, H. Wang, and X. Chen, “Efficient Coordination of Federated Learning and Inference Offloading at the Edge: A Proactive Optimization Paradigm,” IEEE Transactions on Mobile Computing, vol. PP, pp. 1–15, 2024
work page 2024
-
[26]
Human-in-the-loop machine learning: a state of the art,
E. Mosqueira-Rey, E. Hern ´andez-Pereira, D. Alonso-R ´ıos, J. Bobes- Bascar´an, and A. Fern ´andez-Leal, “Human-in-the-loop machine learning: a state of the art,” Artif. Intell. Rev., vol. 56, no. 4, p. 3005–3054, Aug. 2022. [Online]. Available: https://doi.org/10.1007/s10462-022-10246-w
-
[27]
Illustrating reinforcement learning from human feedback (rlhf),
N. Lambert, L. Castricato, L. von Werra, and A. Havrilla, “Illustrating reinforcement learning from human feedback (rlhf),” Hugging Face Blog, 2022, https://huggingface.co/blog/rlhf
work page 2022
- [28]
-
[29]
sahil2801, “Codealpaca-20k,” https://huggingface.co/datasets/sahil2801/ CodeAlpaca-20k
-
[30]
code instructions 120k alpaca,
iamtarun, “code instructions 120k alpaca,” https://huggingface.co/ datasets/iamtarun/code instructions 120k alpaca
-
[31]
tatsu-lab, “alpaca,” https://huggingface.co/datasets/tatsu-lab/alpaca
-
[32]
teknium, “Gpteacher-general-instruct,” https://huggingface.co/datasets/ teknium/GPTeacher-General-Instruct
-
[33]
hakurei, “open-instruct-v1,” https://huggingface.co/datasets/hakurei/ open-instruct-v1
-
[34]
URL https://doi.org/10.1109/HPCA61900.2025.00102
J. Stojkovic, C. Zhang, I. Goiri, J. Torrellas, and E. Choukse, “ DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency ,” in 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). Los Alamitos, CA, USA: IEEE Computer Society, Mar. 2025, pp. 1348–1362. [Online]. Available: https://doi.ieeecomputerso...
-
[35]
dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving,
B. Wu, R. Zhu, Z. Zhang, P. Sun, X. Liu, and X. Jin, “dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving,” OSDI, pp. 911–927, 2024
work page 2024
-
[36]
Peft: State-of-the-art parameter-efficient fine-tuning methods,
S. Mangrulkar, S. Gugger, L. Debut, Y . Belkada, S. Paul, and B. Bossan, “Peft: State-of-the-art parameter-efficient fine-tuning methods,” https:// github.com/huggingface/peft, 2022
work page 2022
-
[37]
Accelerating End-Cloud Collaborative Inference via Near Bubble-free Pipeline Opti- mization,
L. Gao, J. Liu, H. Xu, S. Xu, Q. Ma, and L. Huang, “Accelerating End-Cloud Collaborative Inference via Near Bubble-free Pipeline Opti- mization,” INFOCOM2025, 2025
work page 2025
-
[38]
Adaptive parameter-efficient federated fine-tuning on heterogeneous devices,
J. Liu, Y . Liao, H. Xu, Y . Xu, J. Liu, and C. Qian, “Adaptive parameter-efficient federated fine-tuning on heterogeneous devices,” IEEE Transactions on Mobile Computing, vol. 24, no. 11, pp. 12 533– 12 549, 2025
work page 2025
-
[39]
Haflq: Heterogeneous adaptive federated lora fine-tuned llm with quantization,
Y . Su, N. Yan, Y . Deng, M. Dohler, and R. Schober, “Haflq: Heterogeneous adaptive federated lora fine-tuned llm with quantization,”
-
[40]
Available: https://arxiv.org/abs/2411.06581
[Online]. Available: https://arxiv.org/abs/2411.06581
-
[41]
FwdLLM: Efficient Feder- ated Finetuning of Large Language Models with Perturbed Inferences,
M. Xu, D. Cai, Y . Wu, X. Li, and S. Wang, “FwdLLM: Efficient Feder- ated Finetuning of Large Language Models with Perturbed Inferences,” Proceedings of the 2024 USENIX Annual Technical Conference, ATC 2024, pp. 579–596, 2024
work page 2024
-
[42]
Partitioned collaborative inference for on-device models via evolution- ary reinforcement learning,
L. Tan, P. Zhou, S. Guo, J. Zhao, Z. Kuang, D. Qiao, and L. Yang, “Partitioned collaborative inference for on-device models via evolution- ary reinforcement learning,” in2025 IEEE 45th International Conference on Distributed Computing Systems (ICDCS), 2025, pp. 714–724
work page 2025
-
[43]
Z. Shen, Y . He, Z. Wang, Y . Zhang, G. Sun, W. Ye, and A. Li, EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices. MobiSys, 2025, vol. 1, no. 1
work page 2025
-
[44]
Online Resource Allocation for Edge Intelligence with Colocated Model Retraining and Inference,
H. Cai, Z. Zhou, and Q. Huang, “Online Resource Allocation for Edge Intelligence with Colocated Model Retraining and Inference,” Proceedings - IEEE INFOCOM, pp. 1900–1909, 2024
work page 1900
-
[45]
LLMStation: Resource Multiplexing in Tuning and Serving Large Language Models,
Y . He, H. Yang, Y . Lu, A. Klimovic, and G. Alonso, “LLMStation: Resource Multiplexing in Tuning and Serving Large Language Models,” Proceedings of the 2025 USENIX Annual Technical Conference, ATC 2025, pp. 1639–1655, 2025
work page 2025
-
[46]
Flexllm: Token-level co-serving of llm inference and finetuning with slo guarantees,
G. Oliaro, X. Miao, X. Cheng, V . Kada, M. Wu, R. Gao, Y . Huang, R. Delacourt, A. Yang, Y . Wang, C. Unger, and Z. Jia, “Flexllm: Token-level co-serving of llm inference and finetuning with slo guarantees,” in The 23rd USENIX Symposium on Networked Systems Design and Implementation, 2026. [Online]. Available: https://arxiv.org/abs/2402.18789
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.