A rule-based controller selects among FP16, quantized, speculative, and hybrid modes for single-GPU LLM inference, delivering 2.1x latency speedup and 51.7% lower energy per token with near-baseline accuracy on Llama-3.1-8B.
Gonzalez and Hao Zhang and Ion Stoica , title =
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU
A rule-based controller selects among FP16, quantized, speculative, and hybrid modes for single-GPU LLM inference, delivering 2.1x latency speedup and 51.7% lower energy per token with near-baseline accuracy on Llama-3.1-8B.