LASER: Load-Aware Serving with Early-Exit for Reasoning LLMs at the Edge
Pith reviewed 2026-07-01 03:03 UTC · model grok-4.3
The pith
LASER reduces average latency for edge reasoning LLMs by 17-38% by adapting early-exit thresholds to real-time system load.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LASER couples a load-aware adaptive exit threshold that adjusts the confidence bar based on real-time system load within an empirically validated robust range, and a difficulty- and load-aware reasoning budget pre-allocation that assigns compute resources by request difficulty and system capacity. Formulated as joint optimization of reasoning quality and service latency, the method yields 17-38% lower average latency and 3-6% higher SLO satisfaction than fixed-threshold baselines at an average accuracy cost of 1%.
What carries the argument
Load-aware adaptive exit threshold paired with difficulty- and load-aware reasoning budget pre-allocation, operating as a joint optimizer of quality and latency.
If this is right
- Average latency drops 17-38% relative to fixed-threshold early-exit baselines.
- SLO satisfaction rises 3-6% under the same load conditions.
- Accuracy cost remains near 1% across two reasoning models and four benchmarks.
- The gains hold under diverse load conditions on edge hardware.
Where Pith is reading between the lines
- The same load-adaptive logic could be applied to non-edge serving clusters that experience bursty traffic.
- Pre-allocation rules might be extended to decide which requests receive full model depth versus compressed variants.
- The framework suggests a path to running larger reasoning models on the same edge hardware without proportional latency increases.
Load-bearing premise
Adjusting the exit threshold according to current load keeps reasoning quality intact inside an empirically tested range.
What would settle it
A controlled run under rapidly fluctuating load where either latency reduction falls below 10% or accuracy loss exceeds 3% while using the adaptive threshold.
Figures
read the original abstract
Large reasoning models (LRMs) such as DeepSeek-R1 have achieved strong performance through extended chain-of-thought (CoT) generation. However, deploying them on edge devices raises a conflict between long CoT sequences and constrained resources. Recent confidence-based early exit methods reduce CoT length for individual requests, yet they apply fixed thresholds from a single-request perspective, ignoring multi-request concurrency and load fluctuation in edge serving. To bridge this gap, we propose \underline{L}oad-\underline{A}ware \underline{S}erving with \underline{E}arly-exit for \underline{R}easoning (LASER). LASER couples two complementary designs: (1) a load-aware adaptive exit threshold that adjusts the confidence bar based on real-time system load within an empirically validated robust range, and (2) a difficulty- and load-aware reasoning budget pre-allocation that assigns compute resources by request difficulty and system capacity. We formulate the problem as a joint optimization of reasoning quality and service latency. Experiments on two reasoning models, four benchmarks, and diverse load conditions show that LASER reduces average latency by 17--38\% and improves service-level objective (SLO) satisfaction by 3--6\% over fixed-threshold baselines, at an average accuracy cost of only 1\%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce LASER for load-aware serving with early-exit for reasoning LLMs at the edge. It couples (1) a load-aware adaptive exit threshold that adjusts the confidence bar based on real-time system load within an empirically validated robust range and (2) a difficulty- and load-aware reasoning budget pre-allocation. The problem is formulated as joint optimization of reasoning quality and service latency. Experiments on two reasoning models, four benchmarks, and diverse load conditions report 17--38% average latency reduction, 3--6% SLO satisfaction improvement, and 1% average accuracy cost over fixed-threshold baselines.
Significance. If the experimental outcomes hold after full validation of the mechanisms, the work addresses a practical gap in edge deployment of large reasoning models by incorporating multi-request concurrency and load fluctuations into early-exit decisions, which could improve real-world serving efficiency.
major comments (2)
- Abstract: the quantitative claims (17--38% latency reduction, 3--6% SLO gain, 1% accuracy cost) rest on an adjustment mechanism and joint optimization whose formulation, parameters, and validation are not described, preventing assessment of whether the reported gains are load-bearing or reproducible.
- The load-aware adaptive exit threshold is stated to operate 'within an empirically validated robust range' that preserves reasoning quality, yet no derivation, pseudocode, or validation procedure for this range is supplied; this free parameter directly underpins the central claim that quality is maintained under varying load.
minor comments (1)
- The abstract mentions 'diverse load conditions' and 'fixed-threshold baselines' without naming the specific loads, models, or baseline implementations; these details belong in the experimental section for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for greater transparency in our mechanisms. We agree that additional details on the joint optimization, parameters, and validation are required for reproducibility and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: Abstract: the quantitative claims (17--38% latency reduction, 3--6% SLO gain, 1% accuracy cost) rest on an adjustment mechanism and joint optimization whose formulation, parameters, and validation are not described, preventing assessment of whether the reported gains are load-bearing or reproducible.
Authors: We agree the abstract claims depend on mechanisms whose formulation and validation details are insufficiently described in the current manuscript. The joint optimization is formulated in Section 3 as minimizing a combined objective of expected reasoning quality (measured via accuracy on held-out sets) and service latency under concurrency constraints, subject to per-request difficulty estimates and real-time load. Key parameters include the load scaling factor α (set to 0.2 in experiments) and difficulty bins derived from token-length quantiles. We will add an explicit subsection in the revised version with the full optimization problem statement, parameter table, and pseudocode for the solver. The reported gains are from the end-to-end experiments in Section 5; we will also include a sensitivity analysis to demonstrate load-bearing behavior. revision: yes
-
Referee: The load-aware adaptive exit threshold is stated to operate 'within an empirically validated robust range' that preserves reasoning quality, yet no derivation, pseudocode, or validation procedure for this range is supplied; this free parameter directly underpins the central claim that quality is maintained under varying load.
Authors: We acknowledge that the derivation and validation procedure for the robust range are not provided. The range (confidence thresholds between 0.65 and 0.92) was obtained by sweeping thresholds under synthetic load traces on the four benchmarks and selecting the interval where accuracy drop remained below 2% relative to full CoT. We will insert a new paragraph in Section 4.1 with the validation procedure, the empirical accuracy-vs-load curves, and pseudocode for the adaptive threshold function f(load, difficulty). This addition will directly support the quality-preservation claim and improve reproducibility. revision: yes
Circularity Check
No derivation chain or equations; purely empirical claims
full rationale
The paper proposes LASER with two designs (load-aware adaptive exit threshold and difficulty/load-aware pre-allocation) formulated as joint optimization of quality and latency. However, no equations, derivations, or mathematical steps are presented in the abstract or described content. All performance claims (17-38% latency reduction, 3-6% SLO improvement, 1% accuracy cost) rest on direct experimental outcomes across models and benchmarks, not on quantities defined in terms of fitted parameters or self-referential constructions. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing elements. This is a standard empirical systems paper with no circularity in any derivation chain.
Axiom & Free-Parameter Ledger
free parameters (1)
- robust range for exit threshold
axioms (1)
- domain assumption Adjusting early-exit confidence based on real-time load within a robust range preserves reasoning quality while reducing latency under concurrency.
Reference graph
Works this paper leans on
-
[1]
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
Aggarwal, P., Welleck, S.: L1: Controlling how long a reasoning model thinks with reinforcement learning. arXiv preprint arXiv:2503.04697 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
In: 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI)
Agrawal, A., Kedia, N., Panwar, A., Mohan, J., Kwatra, N., Gulavani, B.S., Tu- manov, A., Ramjee, R.: Taming throughput-latency tradeoff in llm inference with sarathi-serve. In: 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI). pp. 117–134 (2024)
2024
-
[3]
In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) (2025)
Arora, D., Zanette, A.: Training language models to reason efficiently. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) (2025)
2025
-
[4]
In: International Conference on Wireless Artificial Intelligent Computing Systems and Applications (WASA)
Chen, Q., Gao, H., Yi, M., Li, J., Cheng, L., Li, Y.: Latency-optimal and memory- aware model partitioning for cooperative inference at the edge. In: International Conference on Wireless Artificial Intelligent Computing Systems and Applications (WASA). pp. 25–37 (2025)
2025
-
[5]
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
Chen, X., Xu, J., Liang, T., He, Z., Pang, J., Yu, D., Song, L., Liu, Q., Zhou, M., Zhang, Z., et al.: Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
In: Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers)
Elhoushi, M., Shrivastava, A., Liskovich, D., Hosmer, B., Wasti, B., Lai, L., Mah- moud, A., Acun, B., Agarwal, S., Roman, A., et al.: Layerskip: Enabling early exit inference and self-speculative decoding. In: Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 12622–12642 (2024) 12 Z. Tang et al
2024
-
[8]
IEEE Transactions on Sustainable Computing10(4), 678–689 (2025)
Fu, B., Chen, F., Li, P., Zeng, D.: Serving transformer models via joint requst scheduling and batching in the network edge. IEEE Transactions on Sustainable Computing10(4), 678–689 (2025)
2025
-
[9]
In: ICLR 2025 Workshop on Foundation Models in the Wild (2025)
Fu, Y., Chen, J., Zhuang, Y., Fu, Z., Stoica, I., Zhang, H.: Reasoning without self- doubt: More efficient chain-of-thought through certainty probing. In: ICLR 2025 Workshop on Foundation Models in the Wild (2025)
2025
-
[10]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
In: The Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS) (2021)
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring mathematical problem solving with the math dataset. In: The Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS) (2021)
2021
-
[12]
Qwen2.5-Coder Technical Report
Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al.: Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Kang, Y., Sun, X., Chen, L., Zou, W.: C3ot: Generating shorter chain-of-thought without compromising effectiveness. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 24312–24320 (2025)
2025
-
[14]
In: Proceedings of the 29th symposium on operating systems prin- ciples (SOSP)
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serving with pagedattention. In: Proceedings of the 29th symposium on operating systems prin- ciples (SOSP). pp. 611–626 (2023)
2023
-
[15]
arXiv preprint arXiv:2505.17052 (2025)
Park, J., Cho, S., Han, D.: Specedge: Scalable edge-assisted serving framework for interactive llms. arXiv preprint arXiv:2505.17052 (2025)
-
[16]
In: First conference on language modeling (COLM) (2024)
Rein, D., Hou, B.L., Stickland, A.C., Petty, J., Pang, R.Y., Dirani, J., Michael, J., Bowman, S.R.: Gpqa: A graduate-level google-proof q&a benchmark. In: First conference on language modeling (COLM) (2024)
2024
-
[17]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Snell, C., Lee, J., Xu, K., Kumar, A.: Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
In: 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI)
Sun, B., Huang, Z., Zhao, H., Xiao, W., Zhang, X., Li, Y., Lin, W.: Llumnix: Dy- namic scheduling for large language model serving. In: 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI). pp. 173–191 (2024)
2024
-
[19]
Chain of draft: Thinking faster by writing less.arXiv preprint arXiv:2502.18600, 2025a
Xu, S., Xie, W., Zhao, L., He, P.: Chain of draft: Thinking faster by writing less. arXiv preprint arXiv:2502.18600 (2025)
-
[20]
IEEE transactions on Mobile computing (2025)
Xu, X., Hu, Y., Cui, G., Qi, L., Dou, W., Cai, Z.: Cadec: a combinatorial auction for dynamic distributed dnn inference scheduling in edge-cloud networks. IEEE transactions on Mobile computing (2025)
2025
-
[21]
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T
Yang, C., Si, Q., Duan, Y., Zhu, Z., Zhu, C., Li, Q., Chen, M., Lin, Z., Wang, W.: Dynamic early exit in reasoning models. arXiv preprint arXiv:2504.15895 (2025)
-
[22]
In: 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)
Yu, G.I., Jeong, J.S., Kim, G.W., Kim, S., Chun, B.G.: Orca: A distributed serving system for{Transformer-Based}generative models. In: 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI). pp. 521–538 (2022)
2022
-
[23]
IEEE Internet of Things Journal12(10), 13119– 13131 (2024)
Zhang, M., Shen, X., Cao, J., Cui, Z., Jiang, S.: Edgeshard: Efficient llm inference via collaborative edge computing. IEEE Internet of Things Journal12(10), 13119– 13131 (2024)
2024
-
[24]
In: 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI)
Zhong, Y., Liu, S., Chen, J., Hu, J., Zhu, Y., Liu, X., Jin, X., Zhang, H.: Dist- serve: Disaggregating prefill and decoding for goodput-optimized large language model serving. In: 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI). pp. 193–210 (2024)
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.