Scheduling llm inference with uncertainty-aware output length predictions

Zheng, H · 2026 · cs.LG · arXiv 2604.00499

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

To schedule LLM inference, the \textit{shortest job first} (SJF) principle is favorable by prioritizing requests with short output lengths to avoid head-of-line (HOL) blocking. Existing methods usually predict a single output length for each request to facilitate scheduling. We argue that such a \textit{point estimate} does not match the \textit{stochastic} decoding process of LLM inference, where output length is \textit{uncertain} by nature and determined by when the end-of-sequence (EOS) token is sampled. Hence, the output length of each request should be fitted with a distribution rather than a single value. With an in-depth analysis of empirical data and the stochastic decoding process, we observe that output length follows a heavy-tailed distribution and can be fitted with the log-t distribution. On this basis, we propose a simple metric called Tail Inflated Expectation (TIE) to replace the output length in SJF scheduling, which adjusts the expectation of a log-t distribution with its tail probabilities to account for the risk that a request generates long outputs. To evaluate our TIE scheduler, we compare it with three strong baselines, and the results show that TIE reduces the per-token latency by $2.31\times$ for online inference and improves throughput by $1.42\times$ for offline data generation.

representative citing papers

Robust Length Prediction: A Perspective from Heavy-Tailed Prompt-Conditioned Distributions

cs.LG · 2026-04-09 · unverdicted · novelty 6.0

LLM output lengths conditioned on a prompt form heavy-tailed distributions, so robust estimation from multiple samples outperforms single-sample labels for prediction.

MiCU: End-to-End Smart Home Command Understanding with Large Language Model

cs.CL · 2026-05-31 · unverdicted · novelty 4.0

MiCU is a domain-adapted LLM for smart-home command understanding that reports 20% average accuracy gains over baselines and is deployed in the Xiaomi Home app.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Robust Length Prediction: A Perspective from Heavy-Tailed Prompt-Conditioned Distributions cs.LG · 2026-04-09 · unverdicted · none · ref 21 · internal anchor
LLM output lengths conditioned on a prompt form heavy-tailed distributions, so robust estimation from multiple samples outperforms single-sample labels for prediction.
MiCU: End-to-End Smart Home Command Understanding with Large Language Model cs.CL · 2026-05-31 · unverdicted · none · ref 41 · internal anchor
MiCU is a domain-adapted LLM for smart-home command understanding that reports 20% average accuracy gains over baselines and is deployed in the Xiaomi Home app.

Scheduling llm inference with uncertainty-aware output length predictions

fields

years

verdicts

representative citing papers

citing papers explorer