TCM-Serve applies modality-aware scheduling to reduce average TTFT by 54% and 78.5% for latency-critical requests in MLLM inference.
Slo-aware scheduling for large language model inferences
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.DC 2verdicts
UNVERDICTED 2representative citing papers
HFX jointly designs scheduling and scaling for multi-SLO LLM serving, achieving up to 4.44x higher SLO attainment, 65.82% lower latency, and 49.81% lower cost than prior systems on multi-task workloads.
citing papers explorer
-
TCM-Serve: Modality-aware Scheduling for Multimodal Large Language Model Inference
TCM-Serve applies modality-aware scheduling to reduce average TTFT by 54% and 78.5% for latency-critical requests in MLLM inference.
-
HFX: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling
HFX jointly designs scheduling and scaling for multi-SLO LLM serving, achieving up to 4.44x higher SLO attainment, 65.82% lower latency, and 49.81% lower cost than prior systems on multi-task workloads.