pith. sign in

arxiv: 2604.16400 · v2 · pith:37OXJOHNnew · submitted 2026-03-31 · 💻 cs.DC · cs.AI· cs.LG

CoLLM: Continuous Adaptation for SLO-Aware LLM Serving on Shared GPU Clusters

Pith reviewed 2026-05-21 10:29 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LG
keywords LLM servingfine-tuningedge intelligenceGPU clustersparameter sharingSLO awarenesscontinuous adaptationfederated learning
0
0 comments X

The pith

CoLLM unifies fine-tuning and inference on shared edge GPU replicas so that model updates improve serving quality without extra deployments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CoLLM as a system that places federated parameter-efficient fine-tuning and inference on the same replicas instead of running them separately. An intra-replica sharing method keeps inference paths unmerged and uses shadow adapters so that fresh parameters reach inference immediately. A two-timescale coordination routine then moves work across replicas to protect short-term response targets while still harvesting long-term quality gains from fine-tuning. If the approach holds, edge clusters no longer need duplicate hardware pools or wait for separate retraining cycles before new adaptations appear in production.

Core claim

CoLLM shows that a unified co-execution framework on shared replicas, built from unmerged inference plus shadow adapters inside each replica and a two-timescale balancing algorithm between replicas, can jointly raise long-term model quality and short-term inference efficiency, delivering up to three times the goodput of systems that isolate the two workloads.

What carries the argument

Intra-replica model sharing via unmerged inference and shadow adapter strategies, paired with a two-timescale inter-replica coordination algorithm that balances workloads.

If this is right

  • Fewer total model copies are needed because one replica set supports both ongoing adaptation and live serving.
  • Inference quality improves as soon as fine-tuning updates arrive rather than after a separate deployment step.
  • Service-level objectives for latency and throughput can be maintained while model quality still advances over time.
  • Edge deployments of domain-specific LLMs become faster because training and serving share the same resource footprint.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sharing pattern could collapse separate training and prediction pools in other paired workloads such as online learning for recommendation systems.
  • Operators might cut overall GPU hours by merging what are now treated as two independent resource classes.
  • Extensions could allow multiple simultaneous adaptation streams to run on one replica while preserving the same balancing logic.

Load-bearing premise

Unmerged inference and shadow adapters can reuse parameters in real time without unacceptable overhead or correctness issues on the shared replicas.

What would settle it

A side-by-side measurement of end-to-end latency and accuracy when fine-tuning runs on the same replica as inference versus when the two tasks use entirely separate replicas.

Figures

Figures reproduced from arXiv: 2604.16400 by Na Yan, Shaoyuan Huang, Tiancheng Zhang, Wenyu Wang, Xiaofei Wang, Xiaokai Wang, Yansha Deng, Yunfeng Zhao.

Figure 1
Figure 1. Figure 1: GPU multiplexing versus model sharing. service traces show that CoLLM improves goodput (i.e., inference quality-aware throughput) by up to 3× over state-of-the-art LLMs serving systems. II. BACKGROUND AND MOTIVATION A. GPU Multiplexing and Model Sharing Previous studies have proposed different methods for scheduling heterogeneous workloads (in our case, FL PEFT and inference) on shared resources (e.g., GPU… view at source ↗
Figure 2
Figure 2. Figure 2: Inference CE loss of different execution paradigms. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Merged/unmerged inference vs. shadow adapter-based [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Inference throughput and goodput of different systems across two LLMs and two domain-specific tasks. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: System performance under different load scales. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

As Large Language Models (LLMs) are increasingly adopted in edge intelligence to power domain-specific applications and personalized services, the quality and efficiency of the LLM post-training phase-including fine-tuning and inference, have become critical due to constrained resources. Although recent advances in federated parameter-efficient fine-tuning (FL PEFT) and low-latency inference have improved individual task performance, fine-tuning and inference are still handled as isolated workloads, which overlooks their interdependence and results in redundant deployments and delayed improvement in inference quality. To address these limitations, we introduce a new co-execution framework and instantiate it with CoLLM, a system that unifies FL PEFT and inference on shared edge replicas and model parameters. CoLLM addresses key challenges at both replica and cluster levels through: (1) an intra-replica model sharing mechanism that enables real-time model parameter reuse via unmerged inference and shadow adapter strategies; and (2) a two-timescale inter-replica coordination algorithm that adaptively balances fine-tuning and inference workloads to jointly optimize long-term model quality gains and short-term inference efficiency. Extensive evaluation across diverse LLMs and real-world traces show that CoLLM consistently outperforms state-of-the-art LLM systems, achieving up to 3x higher goodput, demonstrating its effectiveness in enabling seamless LLM post-training for edge intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents CoLLM, a system for continuous adaptation of LLMs on shared GPU clusters. It unifies federated parameter-efficient fine-tuning (FL PEFT) and inference workloads on shared replicas using an intra-replica model sharing mechanism based on unmerged inference and shadow adapters, combined with a two-timescale inter-replica coordination algorithm. The central claim is that this approach enables seamless post-training while achieving up to 3x higher goodput than state-of-the-art LLM serving systems, as demonstrated through evaluations on diverse LLMs and real-world traces.

Significance. If the performance claims hold and the sharing mechanism preserves inference quality, this work could have significant impact on edge intelligence by allowing joint optimization of model quality and serving efficiency without redundant resource allocation. The explicit handling of the interdependence between fine-tuning and inference is a strength, and the use of real-world traces adds to the practical relevance. However, the lack of detailed validation for the core mechanism limits the current assessment of its broader implications.

major comments (2)
  1. [Section 4.1] The intra-replica model sharing mechanism via unmerged inference and shadow adapters is described, but there is no explicit validation or equivalence check (e.g., output distribution similarity or downstream task accuracy) comparing it to standard merged-adapter serving after PEFT updates. This is load-bearing for the goodput claim, as any degradation in model quality could offset the reported efficiency gains.
  2. [Section 5] The evaluation section reports up to 3x goodput improvements but provides insufficient details on the baselines compared against, statistical error bars, specific characteristics of the workload traces, and criteria for selecting the diverse LLMs. This undermines the verifiability of the central performance claim.
minor comments (2)
  1. [Abstract] The abstract mentions 'goodput' without a brief definition; consider adding one for clarity to readers unfamiliar with the term in this context.
  2. [Throughout] Ensure consistent use of acronyms like FL PEFT upon first use in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with point-by-point responses and commit to revisions that strengthen the presentation without altering our core claims.

read point-by-point responses
  1. Referee: [Section 4.1] The intra-replica model sharing mechanism via unmerged inference and shadow adapters is described, but there is no explicit validation or equivalence check (e.g., output distribution similarity or downstream task accuracy) comparing it to standard merged-adapter serving after PEFT updates. This is load-bearing for the goodput claim, as any degradation in model quality could offset the reported efficiency gains.

    Authors: We acknowledge the value of explicit empirical validation for this load-bearing mechanism. By design, unmerged inference with shadow adapters computes identical outputs to merged-adapter inference because the adapter weights are applied identically in the forward pass without requiring an explicit merge step; this equivalence follows directly from the linear nature of the adapter updates. To address the referee's concern directly, we will add a dedicated subsection in Section 4.1 (or an appendix) reporting output distribution similarity metrics such as KL divergence and cosine similarity on token logits, along with downstream task accuracy comparisons on standard benchmarks for the LLMs evaluated in the paper. revision: yes

  2. Referee: [Section 5] The evaluation section reports up to 3x goodput improvements but provides insufficient details on the baselines compared against, statistical error bars, specific characteristics of the workload traces, and criteria for selecting the diverse LLMs. This undermines the verifiability of the central performance claim.

    Authors: We agree that expanded details will improve verifiability and reproducibility. In the revised manuscript we will augment Section 5 with: explicit identification and configuration details for all baselines (including vLLM, TensorRT-LLM, and other SLO-aware systems); statistical error bars and standard deviations computed over at least five independent runs; precise characteristics of the workload traces (source, request-rate distributions, SLO definitions, and durations); and the selection criteria for the LLMs (parameter scale, architecture family, and public availability). These additions will be presented in both the main text and a new reproducibility table. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical evaluation, not self-referential definitions or fitted predictions

full rationale

The paper presents CoLLM as a co-execution framework with two concrete mechanisms (intra-replica unmerged inference plus shadow adapters, and a two-timescale coordination algorithm) whose effectiveness is asserted via extensive evaluation on diverse LLMs and real-world traces, reporting up to 3x goodput gains. No equations, fitted parameters, or first-principles derivations appear in the provided text that reduce the reported outcomes to the inputs by construction. The central performance claim is therefore falsifiable against external baselines and does not rely on self-citation chains or renaming of known results. This is the normal, non-circular outcome for a systems paper whose load-bearing evidence is experimental rather than deductive.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the feasibility of unified execution without major interference and on the representativeness of the evaluation traces; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Fine-tuning and inference can be co-executed on shared replicas with real-time parameter reuse via unmerged inference and shadow adapters without unacceptable overhead.
    This premise underpins the intra-replica mechanism and is required for the claimed efficiency gains.

pith-pipeline@v0.9.0 · 5796 in / 1185 out tokens · 53182 ms · 2026-05-21T10:29:05.904341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 1 internal anchor

  1. [1]

    (2023) Github copilot

    GitHub Copilot. (2023) Github copilot. [Online]. Available: https: //github.com/features/copilot

  2. [2]

    Y . Mehdi. (2023) Reinventing search with a new ai-powered microsoft bing and edge, your copilot for the web. [Online]. Available: https://blogs.microsoft.com/blog/2023/02/07/

  3. [3]

    (2022) Chatgpt: Optimizing language models for dialogue

    OpenAI. (2022) Chatgpt: Optimizing language models for dialogue. [Online]. Available: https://openai.com/blog/chatgpt/

  4. [4]

    A review on edge large language models: Design, execution, and applications,

    Y . Zheng, Y . Chen, B. Qian, X. Shi, Y . Shu, and J. Chen, “A review on edge large language models: Design, execution, and applications,” ACM Comput. Surv., vol. 57, no. 8, Mar. 2025. [Online]. Available: https://doi.org/10.1145/3719664

  5. [5]

    Mobile edge intelligence for large language models: A contemporary survey,

    G. Qu, Q. Chen, W. Wei, Z. Lin, X. Chen, and K. Huang, “Mobile edge intelligence for large language models: A contemporary survey,” IEEE Communications Surveys & Tutorials, 2025

  6. [6]

    Exploring parameter-efficient fine-tuning to enable foundation models in feder- ated learning,

    G. Sun, U. Khalid, M. Mendieta, P. Wang, and C. Chen, “Exploring parameter-efficient fine-tuning to enable foundation models in feder- ated learning,” in 2024 IEEE International Conference on Big Data (BigData), 2024, pp. 8015–8024

  7. [7]

    Heterogeneous LoRA for federated fine-tuning of on-device foundation models,

    Y . J. Cho, L. Liu, Z. Xu, A. Fahrezi, and G. Joshi, “Heterogeneous LoRA for federated fine-tuning of on-device foundation models,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 12 903–12 913. [Online]. Available: https://aclantholog...

  8. [8]

    Federated fine-tuning for pre-trained foundation models over wireless networks,

    Z. Wang, Y . Zhou, Y . Shi, and K. B. Letaief, “Federated fine-tuning for pre-trained foundation models over wireless networks,” Trans. Wireless. Comm., vol. 24, no. 4, p. 3450–3464, Jan. 2025. [Online]. Available: https://doi.org/10.1109/TWC.2025.3531128

  9. [9]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=nZeVKeeFYf9

  10. [10]

    Beyond scale: the diversity coefficient as a data quality metric demonstrates llms are pre-trained on formally diverse data,

    A. Lee, B. Miranda, and S. Koyejo, “Beyond scale: the diversity coefficient as a data quality metric demonstrates llms are pre-trained on formally diverse data,” Proc. Int. Conf. Mach. Learn. (ICML), 2023

  11. [11]

    DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster,

    Z. Chen, X. Zhao, C. Zhi, and J. Yin, “DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster,”IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 9, pp. 2553–2567, 2023

  12. [12]

    Multiplexing dynamic deep learning workloads with slo-awareness in gpu clusters,

    W. Chen, C. Lu, H. Xu, K. Ye, and C. Xu, “Multiplexing dynamic deep learning workloads with slo-awareness in gpu clusters,” in Proceedings of the Twentieth European Conference on Computer Systems, ser. EuroSys ’25. New York, NY , USA: Association for Computing Machinery, 2025. [Online]. Available: https://doi.org/10.1145/3689031. 3696074

  13. [13]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020

  14. [14]

    Lyra: Elastic scheduling for deep learning clusters,

    J. Li, H. Xu, Y . Zhu, Z. Liu, C. Guo, and C. Wang, “Lyra: Elastic scheduling for deep learning clusters,” in Proceedings of the Eighteenth European Conference on Computer Systems, ser. EuroSys ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 835–850. [Online]. Available: https://doi.org/10.1145/3552326.3587445

  15. [15]

    Serving hetero- geneous machine learning models on Multi-GPU servers with Spatio- Temporal sharing,

    S. Choi, S. Lee, Y . Kim, J. Park, Y . Kwon, and J. Huh, “Serving hetero- geneous machine learning models on Multi-GPU servers with Spatio- Temporal sharing,” in 2022 USENIX Annual Technical Conference (USENIX ATC 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 199–216

  16. [16]

    (2023) Multi-process service (mps)

    NVIDIA. (2023) Multi-process service (mps). [Online]. Available: https://docs.nvidia.com/deploy/mps/index.html

  17. [17]

    Shepherd : Serving DNNs in the Wild,

    H. Zhang, Y . Tang, A. Khandelwal, I. Stoica, and U. C. Berkeley, “Shepherd : Serving DNNs in the Wild,” NSDI, 2023

  18. [18]

    Federated Learning while Pro- viding Model as a Service: Joint Training and Inference Optimization,

    P. Han, S. Wang, Y . Jiao, and J. Huang, “Federated Learning while Pro- viding Model as a Service: Joint Training and Inference Optimization,” Proceedings - IEEE INFOCOM, pp. 631–640, 2024

  19. [19]

    Communication-efficient learning of deep networks from decentralized data,

    B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics. PMLR, 2017

  20. [20]

    Fedadapt: Adaptive offloading for iot devices in federated learning,

    D. Wu, R. Ullah, P. Harvey, P. Kilpatrick, I. Spence, and B. Varghese, “Fedadapt: Adaptive offloading for iot devices in federated learning,” IEEE Internet of Things Journal, vol. 9, no. 21, 2022

  21. [21]

    Qlora: Efficient finetuning of quantized llms,

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” Advances in neural information processing systems, vol. 36, pp. 10 088–10 115, 2023

  22. [22]

    Fedpara: Low-rank hadamard product for communication-efficient federated learning.arXiv preprint arXiv: 2108.06098,

    N. Hyeon-Woo, M. Ye-Bin, and T.-H. Oh, “Fedpara: Low-rank hadamard product for communication-efficient federated learning,” arXiv preprint arXiv:2108.06098, 2021

  23. [23]

    , Tam, D

    H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, and C. Raffel, “Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning,” 2022. [Online]. Available: https://arxiv.org/abs/2205.05638

  24. [24]

    Don’t decay the learning rate, increase the batch size,

    S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V . Le, “Don’t decay the learning rate, increase the batch size,” ICLR, 2018

  25. [25]

    Efficient Coordination of Federated Learning and Inference Offloading at the Edge: A Proactive Optimization Paradigm,

    K. Luo, K. Zhao, T. Ouyang, X. Zhang, Z. Zhou, H. Wang, and X. Chen, “Efficient Coordination of Federated Learning and Inference Offloading at the Edge: A Proactive Optimization Paradigm,” IEEE Transactions on Mobile Computing, vol. PP, pp. 1–15, 2024

  26. [26]

    Human-in-the-loop machine learning: a state of the art,

    E. Mosqueira-Rey, E. Hern ´andez-Pereira, D. Alonso-R ´ıos, J. Bobes- Bascar´an, and A. Fern ´andez-Leal, “Human-in-the-loop machine learning: a state of the art,” Artif. Intell. Rev., vol. 56, no. 4, p. 3005–3054, Aug. 2022. [Online]. Available: https://doi.org/10.1007/s10462-022-10246-w

  27. [27]

    Illustrating reinforcement learning from human feedback (rlhf),

    N. Lambert, L. Castricato, L. von Werra, and A. Havrilla, “Illustrating reinforcement learning from human feedback (rlhf),” Hugging Face Blog, 2022, https://huggingface.co/blog/rlhf

  28. [28]

    manim code,

    thanhkt, “manim code,” https://huggingface.co/datasets/thanhkt/manim code

  29. [29]

    Codealpaca-20k,

    sahil2801, “Codealpaca-20k,” https://huggingface.co/datasets/sahil2801/ CodeAlpaca-20k

  30. [30]

    code instructions 120k alpaca,

    iamtarun, “code instructions 120k alpaca,” https://huggingface.co/ datasets/iamtarun/code instructions 120k alpaca

  31. [31]

    tatsu-lab, “alpaca,” https://huggingface.co/datasets/tatsu-lab/alpaca

  32. [32]

    Gpteacher-general-instruct,

    teknium, “Gpteacher-general-instruct,” https://huggingface.co/datasets/ teknium/GPTeacher-General-Instruct

  33. [33]

    open-instruct-v1,

    hakurei, “open-instruct-v1,” https://huggingface.co/datasets/hakurei/ open-instruct-v1

  34. [34]

    URL https://doi.org/10.1109/HPCA61900.2025.00102

    J. Stojkovic, C. Zhang, I. Goiri, J. Torrellas, and E. Choukse, “ DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency ,” in 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). Los Alamitos, CA, USA: IEEE Computer Society, Mar. 2025, pp. 1348–1362. [Online]. Available: https://doi.ieeecomputerso...

  35. [35]

    dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving,

    B. Wu, R. Zhu, Z. Zhang, P. Sun, X. Liu, and X. Jin, “dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving,” OSDI, pp. 911–927, 2024

  36. [36]

    Peft: State-of-the-art parameter-efficient fine-tuning methods,

    S. Mangrulkar, S. Gugger, L. Debut, Y . Belkada, S. Paul, and B. Bossan, “Peft: State-of-the-art parameter-efficient fine-tuning methods,” https:// github.com/huggingface/peft, 2022

  37. [37]

    Accelerating End-Cloud Collaborative Inference via Near Bubble-free Pipeline Opti- mization,

    L. Gao, J. Liu, H. Xu, S. Xu, Q. Ma, and L. Huang, “Accelerating End-Cloud Collaborative Inference via Near Bubble-free Pipeline Opti- mization,” INFOCOM2025, 2025

  38. [38]

    Adaptive parameter-efficient federated fine-tuning on heterogeneous devices,

    J. Liu, Y . Liao, H. Xu, Y . Xu, J. Liu, and C. Qian, “Adaptive parameter-efficient federated fine-tuning on heterogeneous devices,” IEEE Transactions on Mobile Computing, vol. 24, no. 11, pp. 12 533– 12 549, 2025

  39. [39]

    Haflq: Heterogeneous adaptive federated lora fine-tuned llm with quantization,

    Y . Su, N. Yan, Y . Deng, M. Dohler, and R. Schober, “Haflq: Heterogeneous adaptive federated lora fine-tuned llm with quantization,”

  40. [40]

    Available: https://arxiv.org/abs/2411.06581

    [Online]. Available: https://arxiv.org/abs/2411.06581

  41. [41]

    FwdLLM: Efficient Feder- ated Finetuning of Large Language Models with Perturbed Inferences,

    M. Xu, D. Cai, Y . Wu, X. Li, and S. Wang, “FwdLLM: Efficient Feder- ated Finetuning of Large Language Models with Perturbed Inferences,” Proceedings of the 2024 USENIX Annual Technical Conference, ATC 2024, pp. 579–596, 2024

  42. [42]

    Partitioned collaborative inference for on-device models via evolution- ary reinforcement learning,

    L. Tan, P. Zhou, S. Guo, J. Zhao, Z. Kuang, D. Qiao, and L. Yang, “Partitioned collaborative inference for on-device models via evolution- ary reinforcement learning,” in2025 IEEE 45th International Conference on Distributed Computing Systems (ICDCS), 2025, pp. 714–724

  43. [43]

    Z. Shen, Y . He, Z. Wang, Y . Zhang, G. Sun, W. Ye, and A. Li, EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices. MobiSys, 2025, vol. 1, no. 1

  44. [44]

    Online Resource Allocation for Edge Intelligence with Colocated Model Retraining and Inference,

    H. Cai, Z. Zhou, and Q. Huang, “Online Resource Allocation for Edge Intelligence with Colocated Model Retraining and Inference,” Proceedings - IEEE INFOCOM, pp. 1900–1909, 2024

  45. [45]

    LLMStation: Resource Multiplexing in Tuning and Serving Large Language Models,

    Y . He, H. Yang, Y . Lu, A. Klimovic, and G. Alonso, “LLMStation: Resource Multiplexing in Tuning and Serving Large Language Models,” Proceedings of the 2025 USENIX Annual Technical Conference, ATC 2025, pp. 1639–1655, 2025

  46. [46]

    Flexllm: Token-level co-serving of llm inference and finetuning with slo guarantees,

    G. Oliaro, X. Miao, X. Cheng, V . Kada, M. Wu, R. Gao, Y . Huang, R. Delacourt, A. Yang, Y . Wang, C. Unger, and Z. Jia, “Flexllm: Token-level co-serving of llm inference and finetuning with slo guarantees,” in The 23rd USENIX Symposium on Networked Systems Design and Implementation, 2026. [Online]. Available: https://arxiv.org/abs/2402.18789