Formulates speculation control in distributed LLM inference as optimal stopping, proves delay-monotone thresholds, gives UCB-SpecStop with regret bounds, and reports up to 22% latency reduction on a Jetson-RTX testbed.
Batch speculative decoding done right,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.NI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Delay-Adaptive Speculation Control for Low-Latency Edge-Cloud LLM Inference
Formulates speculation control in distributed LLM inference as optimal stopping, proves delay-monotone thresholds, gives UCB-SpecStop with regret bounds, and reports up to 22% latency reduction on a Jetson-RTX testbed.