Llumnix: Dynamic scheduling for large language model serving

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, Wei Lin · 2024

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows

cs.DC · 2026-03-12 · unverdicted · novelty 7.0

This work delivers the first measurements of performance-energy trade-offs across four multi-request LLM workflow patterns on A100 GPUs using vLLM and Parrot.

Efficient Remote KV Cache Reuse with GPU-native Video Codec

cs.DC · 2026-02-10 · conditional · novelty 7.0

KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.

Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning

cs.DC · 2025-11-18 · unverdicted · novelty 6.0

Seer improves synchronous LLM RL rollout throughput by up to 2.04x and reduces long-tail latency by 72-94% via divided rollout, context-aware scheduling, and adaptive grouped speculative decoding based on prompt similarity observations.

GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources

cs.DC · 2026-05-16 · unverdicted · novelty 4.0

GoodServe proposes a predict-and-rectify routing system for agentic LLM inferences on heterogeneous GPUs that improves goodput by up to 27.4%.

citing papers explorer

Showing 4 of 4 citing papers.

Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows cs.DC · 2026-03-12 · unverdicted · none · ref 67
This work delivers the first measurements of performance-energy trade-offs across four multi-request LLM workflow patterns on A100 GPUs using vLLM and Parrot.
Efficient Remote KV Cache Reuse with GPU-native Video Codec cs.DC · 2026-02-10 · conditional · none · ref 60
KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning cs.DC · 2025-11-18 · unverdicted · none · ref 38
Seer improves synchronous LLM RL rollout throughput by up to 2.04x and reduces long-tail latency by 72-94% via divided rollout, context-aware scheduling, and adaptive grouped speculative decoding based on prompt similarity observations.
GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources cs.DC · 2026-05-16 · unverdicted · none · ref 30
GoodServe proposes a predict-and-rectify routing system for agentic LLM inferences on heterogeneous GPUs that improves goodput by up to 27.4%.

Llumnix: Dynamic scheduling for large language model serving

fields

years

verdicts

representative citing papers

citing papers explorer