ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

Ali Alshehhi; Aman Sunesh; Hivansh Dhakne

arxiv: 2605.23057 · v1 · pith:4IK2MKEFnew · submitted 2026-05-21 · 💻 cs.LG · cs.CL· cs.PF

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

Aman Sunesh , Ali Alshehhi , Hivansh Dhakne This is my paper

Pith reviewed 2026-05-25 05:24 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.PF

keywords LLM inferencemode switchingquantizationspeculative decodingenergy efficiencylatency optimizationsingle GPUrequest routing

0 comments

The pith

A lightweight controller routes each LLM request to an appropriate inference mode using workload features, delivering 2.1x latency speedup and 52% lower energy per token on one GPU while keeping accuracy near FP16.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a rule-based controller can improve single-GPU LLM inference by selecting among existing modes such as FP16, quantization, speculative decoding, and hybrids at each request boundary. It does so with cheap workload-level features instead of a single static configuration. A sympathetic reader would care because static modes leave large efficiency tradeoffs unexploited and retraining or architecture changes are costly. On Llama-3.1-8B the controller yields the reported speed and energy gains with negligible accuracy change. Learned routers add overhead and select constraint-violating modes more often, so the rule-based version is preferred.

Core claim

ModeSwitch-LLM is a lightweight request-boundary controller that selects among FP16, quantized modes, speculative decoding, and hybrid modes such as GPTQ plus prefix caching and INT8 plus continuous batching using cheap workload-level features. On deployment-style synthetic workloads for Meta-Llama-3.1-8B-Instruct on a single NVIDIA A100 GPU, the online controller achieves a 2.10x mean latency speedup over FP16 and a 0.48x mean energy ratio, corresponding to 51.7% lower energy per token. On automatic benchmarks used as a quality gate, accuracy remains close to FP16 with a mean delta of +0.17 percentage points. Learned routers do not clearly outperform the rule-based controller because they 1

What carries the argument

The lightweight request-boundary controller that routes each request to a fixed inference mode using workload-level features.

Load-bearing premise

Cheap workload-level features suffice to select modes that reliably satisfy quality, energy, and memory constraints without post-selection violation.

What would settle it

A workload where the controller's selected mode repeatedly violates accuracy, energy, or memory limits even though the observed features match those used in training the selection rules.

Figures

Figures reproduced from arXiv: 2605.23057 by Ali Alshehhi, Aman Sunesh, Hivansh Dhakne.

**Figure 2.** Figure 2: Online controller latency speedup and energy ratio across workload families. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Controller mode-selection behavior. Panel (a) shows which mode is selected for each [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Learned-controller confusion matrices for the static feature set. The learned routers capture [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Learned-router comparison. The rule controller is a strong practical baseline because it [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

ModeSwitch-LLM is a lightweight request-boundary controller for improving single-GPU large language model inference efficiency by routing each request to an appropriate fixed inference mode. Instead of relying on one static serving configuration, the system selects among FP16, quantized modes, speculative decoding, and hybrid modes such as GPTQ plus prefix caching and INT8 plus continuous batching using cheap workload-level features. We evaluate ModeSwitch-LLM on Meta-Llama-3.1-8B-Instruct served on a single NVIDIA A100 GPU. On deployment-style synthetic workloads, the online controller achieves a 2.10x mean latency speedup over FP16 and a 0.48x mean energy ratio, corresponding to 51.7% lower energy per token. On automatic benchmarks used as a quality gate, accuracy remains close to FP16 with a mean delta of +0.17 percentage points. We also evaluate lightweight learned routers, but find that they do not clearly outperform the rule-based controller because they add routing overhead and more often select modes that violate quality, energy, or memory constraints. These results show that simple request-aware routing can recover substantial efficiency from existing inference modes without retraining the model or changing its architecture.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A lightweight rule-based controller for switching LLM inference modes on one GPU shows concrete speedups on synthetic workloads but leaves the constraint-satisfaction evidence thin.

read the letter

The paper's core contribution is ModeSwitch-LLM, a request-boundary controller that uses cheap workload features to pick among FP16, quantized, speculative, and hybrid modes for a single-GPU LLM. On Llama-3.1-8B-Instruct with an A100, it reports 2.10x mean latency improvement and 0.48x energy ratio versus static FP16 on synthetic workloads, plus near-FP16 accuracy on separate benchmarks. The rule-based design also comes out ahead of lightweight learned routers because the latter add overhead and select violating modes more often. That comparison is the most useful part of the work; it highlights a practical trade-off rather than claiming a new algorithm. The evaluation stays grounded in deployment-style workloads and existing modes without retraining the model. The main soft spot is that the abstract gives no violation rates, per-mode accuracy deltas, or post-selection checks for the rule-based controller on the synthetic workloads themselves. Quality is only reported as a mean delta on the automatic benchmarks used as a gate, so the link between the performance numbers and constraint satisfaction is not fully shown. No error bars or full protocol details appear either. This is a systems paper aimed at practitioners who deploy LLMs on single GPUs and want incremental efficiency without architecture changes. It has enough concrete claims and a clear comparison to merit peer review in a systems venue, though the full manuscript will need to close the evidence gap on how reliably the controller stays within quality, energy, and memory bounds.

Referee Report

2 major / 2 minor

Summary. The paper introduces ModeSwitch-LLM, a lightweight request-boundary controller that routes each LLM inference request to one of several fixed modes (FP16, quantized, speculative decoding, or hybrids such as GPTQ plus prefix caching) on a single GPU using cheap workload-level features. On Meta-Llama-3.1-8B-Instruct, it reports a 2.10x mean latency speedup and 0.48x mean energy ratio (51.7% lower energy per token) versus FP16 on deployment-style synthetic workloads, with accuracy remaining close to FP16 (mean +0.17 pp delta) on separate automatic benchmarks used as a quality gate. Learned routers are evaluated but found inferior due to added overhead and more frequent selection of modes violating quality/energy/memory constraints.

Significance. If verified, the result shows that lightweight rule-based routing across existing inference modes can deliver substantial single-GPU efficiency gains without model retraining or architectural modification. A clear strength is the direct comparison demonstrating that the rule-based controller outperforms learned routers in this regime, underscoring the practical value of low-overhead heuristics for mode selection under real constraints.

major comments (2)

[Evaluation on synthetic workloads] The central performance claims (2.10x latency, 0.48x energy) on synthetic workloads rest on the rule-based controller reliably selecting modes that satisfy quality, energy, and memory constraints. No violation rates, per-mode accuracy deltas, or post-selection verification results are reported for the rule-based controller on these workloads (in contrast to the learned routers, which are stated to select violating modes more often). This gap is load-bearing for the constraint-satisfaction premise.
[Quality evaluation] Accuracy is reported only as a mean delta of +0.17 pp on separate 'automatic benchmarks used as a quality gate.' No corresponding quality or constraint-satisfaction metrics are provided for the synthetic deployment-style workloads on which the latency and energy results are measured, creating a disconnect between the performance evaluation distribution and the evidence for valid mode selection.

minor comments (2)

The manuscript provides no error bars, standard deviations, or details on the number of runs or workload repetitions for the reported mean latency and energy figures.
Full details on the synthetic workload generation, dataset characteristics, and complete evaluation protocol are absent, limiting independent verification of the quantitative claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important aspects of our evaluation methodology, and we address each point below with plans for revision.

read point-by-point responses

Referee: [Evaluation on synthetic workloads] The central performance claims (2.10x latency, 0.48x energy) on synthetic workloads rest on the rule-based controller reliably selecting modes that satisfy quality, energy, and memory constraints. No violation rates, per-mode accuracy deltas, or post-selection verification results are reported for the rule-based controller on these workloads (in contrast to the learned routers, which are stated to select violating modes more often). This gap is load-bearing for the constraint-satisfaction premise.

Authors: The rule-based controller uses deterministic rules explicitly derived to enforce quality, energy, and memory constraints based on workload features (e.g., selecting a mode only when predicted metrics satisfy thresholds). This ensures compliance by construction, unlike the learned routers we evaluated and found to violate constraints more often. We did not report explicit rates because the design precludes violations. To strengthen the presentation, we will add a dedicated subsection with constraint-satisfaction verification results (expected to show full compliance) and any relevant per-mode details for the synthetic workloads. revision: yes
Referee: [Quality evaluation] Accuracy is reported only as a mean delta of +0.17 pp on separate 'automatic benchmarks used as a quality gate.' No corresponding quality or constraint-satisfaction metrics are provided for the synthetic deployment-style workloads on which the latency and energy results are measured, creating a disconnect between the performance evaluation distribution and the evidence for valid mode selection.

Authors: The automatic benchmarks function as an independent quality gate to pre-validate that selected modes maintain acceptable accuracy before use in deployment scenarios. The synthetic workloads focus on latency and energy under realistic serving conditions, relying on this prior validation. We agree that explicitly bridging the two would improve clarity and will revise the manuscript to include constraint-satisfaction metrics (derived from the rule logic) applied to the synthetic workloads, along with any feasible quality sampling analysis. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external workloads with no derivations or self-referential fits

full rationale

The paper reports measured speedups, energy ratios, and accuracy deltas from running a rule-based controller on synthetic workloads and separate automatic benchmarks. No equations, fitted parameters, or derivations are present that reduce results to inputs by construction. Comparisons to learned routers are also direct empirical observations. The evaluation is self-contained against external benchmarks and does not rely on self-citation chains or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the controller relies on unspecified workload features and mode constraints whose details are absent.

pith-pipeline@v0.9.0 · 5763 in / 1255 out tokens · 21894 ms · 2026-05-25T05:24:30.111428+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 2 internal anchors

[1]

Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Pan, Aaran and Zhang, Yuntian and Xu, Pengfei and others , journal =

work page
[2]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Lin, Stephanie and Hilton, Jacob and Evans, Owain , journal =

work page
[4]

, journal =

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , journal =

work page
[5]

Measuring Massive Multitask Language Understanding

Measuring Massive Multitask Language Understanding , author =. arXiv preprint arXiv:2009.03300 , year =

work page internal anchor Pith review Pith/arXiv arXiv 2009
[6]

Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan , journal =

work page
[7]

Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song , journal =

work page
[8]

International Conference on Machine Learning , pages =

Fast Inference from Transformers via Speculative Decoding , author =. International Conference on Machine Learning , pages =. 2023 , organization =

work page 2023
[9]

Advances in Neural Information Processing Systems , volume =

Zhang, Zhenyu and Sheng, Ying and Zhou, Tianyi and Chen, Tianlong and Zheng, Lianmin and Cai, Ruisi and Song, Zhao and Tian, Yuandong and R. Advances in Neural Information Processing Systems , volume =

work page
[10]

Gulavani and Alexey Tumanov and Ramachandran Ramjee , title =

Amey Agrawal and Nitin Kedia and Ashish Panwar and Jayashree Mohan and Nipun Kwatra and Bhargav S. Gulavani and Alexey Tumanov and Ramachandran Ramjee , title =. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages =

work page
[11]

Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles (SOSP '24) , publisher =

Yinwei Dai and Rui Pan and Anand Iyer and Kai Li and Ravi Netravali , title =. Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles (SOSP '24) , publisher =. 2024 , doi =

work page 2024
[12]

Gonzalez and Hao Zhang and Ion Stoica , title =

Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica , title =. Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP '23) , publisher =

work page
[13]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024) , pages =

Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang , title =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024) , pages =. 2024 , doi =

work page 2024
[14]

Proceedings of the 21st European Conference on Computer Systems (EuroSys '26) , publisher =

Zikun Li and Zhuofu Chen and Remi Delacourt and Gabriele Oliaro and Zeyu Wang and Qinghan Chen and Shuhuai Lin and April Yang and Zhihao Zhang and Zhuoming Chen and Sean Lai and Xinhao Cheng and Xupeng Miao and Zhihao Jia , title =. Proceedings of the 21st European Conference on Computer Systems (EuroSys '26) , publisher =. 2026 , note =

work page 2026
[15]

16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages =

Gyeong-In Yu and Joo Seong Jeong and Geon-Woo Kim and Soojeong Kim and Byung-Gon Chun , title =. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages =

work page
[16]

Gonzalez and Clark Barrett and Ying Sheng , title =

Lianmin Zheng and Liangsheng Yin and Zhiqiang Xie and Chuyue Sun and Jeff Huang and Cody Hao Yu and Shiyi Cao and Christos Kozyrakis and Ion Stoica and Joseph E. Gonzalez and Clark Barrett and Ying Sheng , title =. Advances in Neural Information Processing Systems 37 (NeurIPS 2024) , pages =

work page 2024
[17]

18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages =

Yinmin Zhong and Shengyu Liu and Junda Chen and Jianbo Hu and Yibo Zhu and Xuanzhe Liu and Xin Jin and Hao Zhang , title =. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages =

work page

[1] [1]

Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Pan, Aaran and Zhang, Yuntian and Xu, Pengfei and others , journal =

work page

[2] [2]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Lin, Stephanie and Hilton, Jacob and Evans, Owain , journal =

work page

[4] [4]

, journal =

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , journal =

work page

[5] [5]

Measuring Massive Multitask Language Understanding

Measuring Massive Multitask Language Understanding , author =. arXiv preprint arXiv:2009.03300 , year =

work page internal anchor Pith review Pith/arXiv arXiv 2009

[6] [6]

Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan , journal =

work page

[7] [7]

Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song , journal =

work page

[8] [8]

International Conference on Machine Learning , pages =

Fast Inference from Transformers via Speculative Decoding , author =. International Conference on Machine Learning , pages =. 2023 , organization =

work page 2023

[9] [9]

Advances in Neural Information Processing Systems , volume =

Zhang, Zhenyu and Sheng, Ying and Zhou, Tianyi and Chen, Tianlong and Zheng, Lianmin and Cai, Ruisi and Song, Zhao and Tian, Yuandong and R. Advances in Neural Information Processing Systems , volume =

work page

[10] [10]

Gulavani and Alexey Tumanov and Ramachandran Ramjee , title =

Amey Agrawal and Nitin Kedia and Ashish Panwar and Jayashree Mohan and Nipun Kwatra and Bhargav S. Gulavani and Alexey Tumanov and Ramachandran Ramjee , title =. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages =

work page

[11] [11]

Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles (SOSP '24) , publisher =

Yinwei Dai and Rui Pan and Anand Iyer and Kai Li and Ravi Netravali , title =. Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles (SOSP '24) , publisher =. 2024 , doi =

work page 2024

[12] [12]

Gonzalez and Hao Zhang and Ion Stoica , title =

Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica , title =. Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP '23) , publisher =

work page

[13] [13]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024) , pages =

Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang , title =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024) , pages =. 2024 , doi =

work page 2024

[14] [14]

Proceedings of the 21st European Conference on Computer Systems (EuroSys '26) , publisher =

Zikun Li and Zhuofu Chen and Remi Delacourt and Gabriele Oliaro and Zeyu Wang and Qinghan Chen and Shuhuai Lin and April Yang and Zhihao Zhang and Zhuoming Chen and Sean Lai and Xinhao Cheng and Xupeng Miao and Zhihao Jia , title =. Proceedings of the 21st European Conference on Computer Systems (EuroSys '26) , publisher =. 2026 , note =

work page 2026

[15] [15]

16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages =

Gyeong-In Yu and Joo Seong Jeong and Geon-Woo Kim and Soojeong Kim and Byung-Gon Chun , title =. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages =

work page

[16] [16]

Gonzalez and Clark Barrett and Ying Sheng , title =

Lianmin Zheng and Liangsheng Yin and Zhiqiang Xie and Chuyue Sun and Jeff Huang and Cody Hao Yu and Shiyi Cao and Christos Kozyrakis and Ion Stoica and Joseph E. Gonzalez and Clark Barrett and Ying Sheng , title =. Advances in Neural Information Processing Systems 37 (NeurIPS 2024) , pages =

work page 2024

[17] [17]

18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages =

Yinmin Zhong and Shengyu Liu and Junda Chen and Jianbo Hu and Yibo Zhu and Xuanzhe Liu and Xin Jin and Hao Zhang , title =. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages =

work page