Flexgen: high- throughput generative inference of large language models with a single gpu

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang · 2023

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

cs.DC · 2026-03-27 · unverdicted · novelty 7.0

TCM-Serve applies modality-aware scheduling to reduce average TTFT by 54% and 78.5% for latency-critical requests in MLLM inference.

Showing 1 of 1 citing paper.

TCM-Serve: Modality-aware Scheduling for Multimodal Large Language Model Inference cs.DC · 2026-03-27 · unverdicted · none · ref 34
TCM-Serve applies modality-aware scheduling to reduce average TTFT by 54% and 78.5% for latency-critical requests in MLLM inference.